Digital Watermarks Are Not Ready for Large Language Models

Bob Gleichauf; Dan Geer

Cybersecurity & Tech

Digital Watermarks Are Not Ready for Large Language Models

Thursday, February 29, 2024, 8:00 AM

We should pay attention to lessons learned in cybersecurity before adopting digital watermarks in pursuit of GenAI safety.

Futuristic AI networking technology vector (Kappy, https://tinyurl.com/2xw7rfux; CC0 1.0 DEED, https://creativecommons.org/publicdomain/zero/1.0/)

Meet The Authors

Published by The Lawfare Institute
in Cooperation With

Subscribe to Lawfare

It’s been about a year since ChatGPT reset our understanding of what artificial intelligence can do. The capabilities of generative artificial intelligence (GenAI) and large language models (LLMs) are stunning. But accompanying these capabilities is a rapidly lengthening list of concerns—including privacy, copyright protection, trustworthiness, and other matters. Lessons learned from cybersecurity indicate that recent proposals to use digital watermarks to address some of these concerns, while seemingly well intentioned, are unlikely to achieve their supporters’ goals.

At the crux of the issues is that while LLM technology is showing signs of planning and reasoning that go beyond mere autocomplete engines, LLMs are still unable to be introspective—they do not consider motivation or how results are generated. In system dynamics terms, they have no state representation of the world, and therefore they cannot police themselves. A range of proposals have been put forward for how we might improve LLM “safety” (the EU AI Act uses the term “safety” 114 times), beginning with something as seemingly basic as an ability to distinguish whether a text artifact was authored by a model or by a human being.

One proposal being promoted by LLM vendors Amazon, Anthropic, Google, Microsoft, and OpenAI—among others—involves embedding hidden, permanent, immutable “watermarks” in LLM-generated content to document its origin, or “provenance.” The idea of creating a text equivalent of the physical watermarks used for paper currency or the digital watermarks used for image copyright does sound intriguing. But history shows us that the cost of circumventing these watermarks can actually be quite low, such as when high-quality photocopiers became available. In our opinion, digital watermarks will offer even less benefit when applied to LLMs for reasons that range from how digital watermarks operate, to their economics, to, most important of all, their resiliency.

We are deeply sympathetic to the difficulty in making sensible policy for fast-moving targets like GenAI, but our first message here is this: When it comes to mandating safety mechanisms for LLMs, do not make hard-to-reverse choices yet. But choices are already being made such as recent language in the National Defense Authorization Act (NDAA) for Fiscal Year 2024 that incentivizes the creation of digital-watermarking tools. Our second message is that when it comes to thinking about GenAI safety in general—and the potential role of digital watermarks in LLMs in particular—look to the cybersecurity realm, which has long dealt with problems that parallel those now facing this new technology area. The questions in bold below are a sample of how you might evaluate these issues through a cybersecurity lens.

“Who Are You? Who, Who, Who, Who?”

Those lyrics from a classic song by legendary British rock band The Who get to the heart of basic cybersecurity, which begins with authentication. While we have a fair understanding of how best to authenticate a person, or a proxy for a person such as a sensor or a service, we have yet to settle on how best to prove the provenance of a digital artifact in the form of text— which is easier to cut and paste than other forms of content. A person may know something, such as a secret password; may have something, such as a non-inspectable device containing a secret like a cell phone or a USB token; or may be something, such as displaying a previously recorded fingerprint or iris scan. (Any two of these can be combined for a two-factor authentication scheme.) A text object, in contrast, doesn’t know anything and doesn’t have anything; it only is something. But could that something be self-authenticating like the fingerprint or iris scan?

Not directly. Fingerprints and iris scans work because I cannot steal yours nor can I change mine. This is what policymakers and vendors are looking for with watermarks—a technique for putting proof of origin in the text itself. But text is malleable, and something inserted into it, or removed from it, is observable by all. To be self-authenticating, the watermark must be hidden in the text, which is to say it must be hidden in plain view. Similarly, to be proof, it must be hidden proof that can nevertheless be verified later under suitable conditions but, by virtue of its being hidden, can neither be removed nor modified—it is immutable. In short, a digital watermark is data. And it is data embedded inside a body of text such that the embedding is not obvious and has no deleterious effect on the usability of the text in which it is placed. This is a high goal.

To use cybersecurity terminology, watermarking is a “steganographic” mechanism. Whereas cryptographic messages are visible and their content is protected by a (secret) key, steganographic messages are hidden and their existence, and thus their content, is protected by a (secret) algorithm. Most watermark makers would want to communicate provenance information such as a globally unique identifier (GUID), time-of-creation, authorship, and the like. Because the existence of this communication must be not only impossible for others to read (as with encryption) but also impossible for others to detect, the algorithm is needed to find as well as examine digital watermarks.

Those are the first-order principles of digital watermarking. The cyber principles discussed next help illustrate why watermarking cannot effectively substantiate the origin of LLM content.

Which Threats Matter?

In cybersecurity, secure-by-design protection begins with an explicit threat model—a detailed assessment of your adversaries’ capabilities along with the threshold of difficulty those adversaries must overcome to achieve their goals. You effectively choose the types of failure you can tolerate and provide countermeasures for those you cannot. A clear LLM threat model must be articulated before we can properly assess the utility of watermarks.

A clear threat model has not been crafted for LLMs, though some pronouncements have implied potential requirements for such a model. The fact sheet for the Oct. 30, 2023, Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence declares: “The Department of Commerce will develop guidance for content authentication and watermarking to clearly label AI-generated content.” It also states that “[f]ederal agencies will use these tools to make it easy for Americans to know that the communications they receive from their government are authentic.” These statements suggest the threat envisioned here is primarily that of American citizens being misled in some way by LLM-generated government communications and less about differentiating LLM-generated content from human-generated content in such communications.

There is some focus (including in the full executive order) on that differentiation issue, but without a sharper definition of a threat model, it’s hard to gauge how effective watermarks—or any other approach to proving provenance for that matter—will be. This is one reason for our earlier exhortation to not make hard-to-reverse choices about LLM assurance tools just yet.

Whom Do You Trust—and for How Long?

Trust in a system—or anything else—must either be assumed (as a given) or derived (extended from something already trusted). Put differently, assurance systems have a single trust anchor, a system component for which trust is assumed rather than derived and from which trust can be extended to all subordinate artifacts created under that domain of trust. For subordinate components, trust takes the form of, “If you trust A (the trust anchor in this example), then under these circumstances you can trust B” or “B trusts A and C trusts B; therefore, C trusts A.” In public-key cryptographic systems, for example, that anchor is the topmost, or “root,” key along with the issuing authority that holds it. That root key will sign subordinate keys, and if you trust the root key, then you can trust the subordinate keys the root key has signed. In most cases, this root key is kept offline and the other keys underneath it are in a hierarchy. If one of these subordinate keys is compromised, the ramifications of that compromise extend only to that one entity and its subordinate entities farther down in the hierarchy. In practice, the effectiveness and resilience of those public-key systems is based on maintaining the secrecy of keys and the resistance of the encryption algorithm to brute force or mathematical attacks.

Steganographic systems also have a trust anchor in the form of a secret algorithm as well as the entity that holds it, but that trust is based on a shared secret (the algorithm) and, therefore, is not hierarchical in nature. So any compromise of the steganographic algorithm will negate the trustworthiness of all the watermarks previously issued by that authority. This is a significant concern: Whether speaking of human history or cybersecurity history, it is difficult to regain trust in an authority once it has been breached. Implementing (steganographic) watermarks would mean some problems, especially those related to resilience, that are hard in cryptography systems are even harder in this context.

It is, of course, advantageous for consumers of documents to be able to verify that those documents have not been altered after they were produced. This is already a solved problem so long as you are willing to have that proof visible in the form of a digitally signed digest of the content. (A digest is a short summary of a document that is, like an identification number, unique but not substantively meaningful.) But a visible signature applied to a digest of the content works only if you tie the assurance provided by the cryptographic signature to a public-key cryptosystem (because, otherwise, why trust the key making the signature?). Note that because a visible signature attached to a document can be removed, documents never signed and documents whose signature was removed become indistinguishable. Making the watermark indetectable and thus irremovable is thus a necessary part of the watermark’s permanence.

Besides the requirement for watermarks to be indetectable (and thus irremovable), there is also the requirement that they remain immutable once they are applied. While experimental evidence shows text-based watermarks can be designed to withstand a variety of attacks, such as paraphrasing and cut-and-paste, their effectiveness (referred to by the researchers as “robustness”) is a function of text length. Text snippets under 1,000 words become more of a challenge, with effectiveness dropping steadily as the text shrinks in size. Other researchers have shown that watermarks can be filtered out of derivative LLM content by recursively inserting LLM output back into their models’ training data, leading to conditions sometimes referred to as model “collapse” or “poisoning.” In short, there is a classic cybersecurity dual-use case here. In the same way, watermark-detection tools, which are the essential component for verifying provenance claims, can be run repeatedly by bad actors, too. They would repeatedly alter and test the real text looking for a version to their liking that also passes the watermark test. In this manner, attackers can evade a watermark’s integrity guarantee as well as its effectiveness as proof of provenance; it is only a question of attacker persistence.

Can There Be Silent Failure?

A core tenet of cybersecurity design is “no silent failure.” If a defense mechanism is breached, you should be able to tell it has been compromised. This poses a challenge for watermarks. If you want to verify a watermark from another party, you must ask that party (or its delegate) to confirm the watermark. But what if the party you are checking with has been compromised or is itself corrupt?

Public-key cryptosystems suggest a possible approach to deal with this issue. In these systems, a digital certificate is used to verify that a key was validly issued (that is, its validity can be traced to the trust anchor of the particular public-key cryptosystem). But to protect both parties at the time of a transaction, the certificate must be status-checked immediately before it is accepted as validation for a given key. Protocols and centralized services for handling verification are essential here. (LLM watermarks would benefit from lessons learned by cyber experts who designed and deployed security services such as the Online Certificate Status Protocol.)

Replacing a compromised encryption key is one thing; replacing a compromised encryption algorithm is another. In cybersecurity, “algorithm agility” is a system characteristic such that when a cryptographic algorithm is found deficient, that algorithm can be swapped quickly with another. Implementation of such a capability requires centralization of control and automation of deployment.

This dynamic could also apply to digital watermarks. Agility requires careful design as well as implementation of a fallback option. For cybersecurity, on-demand replacement of code for Algorithm-A with code for Algorithm-B on potentially many platforms during a live response to an incident is not operationally feasible in every case. A better practice is to predeploy both and then run both algorithms at random, so they are regularly exercised and known to work. Then, in the event of a compromised algorithm, that compromised algorithm can be disabled, and the uncompromised algorithm can promptly take over from the compromised one with minimal impact on the overall ecosystem. Whether such an analysis applies usefully to the compromise of a watermarking algorithm needs further study. But this type of “hot swappable” system can require significant financial costs, which brings us to another cautionary note.

How Expensive Could Watermarks Be?

The answer is, we don’t know yet, but there’s certainly no “free lunch” when it comes to digital watermarks. Three researchers in India in 2021 found that nesting—embedding multiple watermarks inside one another—could help make the approach more resilient, but unpacking nested watermarks for verification could end up being computationally more expensive than generating them. The cybersecurity world has repeatedly shown that protection is often asymmetric—meaning the cost to defeat a defensive measure is less than the cost of implementing it. This principle plays out here as well.

Like many cybersecurity toolsets, watermarks also require a supporting ecosystem, and those costs must be borne by content consumers as well as by content producers and providers. Before we can calculate these costs, we need to understand who the stakeholders might be, bringing us back to the need for a crisp problem statement.

Who Has the Authority?

There are plenty of questions that need to be answered here. For example, will there be only one watermark authority, something akin to how the Internet Corporation for Assigned Names and Numbers (ICANN) manages the internet domain-name system? Or is it better to have multiple authorities representing different content producers, providers, and even countries? Whether one or many, watermarks have no value without an issuing authority. We already have competing LLMs from companies such as Meta, Google, Hugging Face, Microsoft, and OpenAI. Will each of them want to stand up its own watermark authority in the absence of a trusted centralized body such as an ICANN equivalent? If they all decide to do so, who will assume the cost of integrating watermarks into their operations? Content providers will certainly have something to say about this. Whether one or many watermark authorities are stood up, what guarantees can be made and what legal structure can be established for failures of the type associated with similar protocols, such as the Online Certificate Status Protocol, mentioned earlier? What liability, if any, does a compromised watermark imply? Can this be more like a pan-internet standard, or will it be country-by-country law?

Is There an Alternative?

As we noted earlier in this article, we are sympathetic to the challenges of regulating fast-changing technologies and can see why extending watermarks to LLM models is intuitively appealing. But cybersecurity history implies that LLM watermarks could be circumventable, unacceptably burdensome, or both. Moreover, the prospect that they may become obsolete within the useful lifetime of the otherwise protected text can’t be ignored, much as rapid advances in quantum computing threaten to break existing cryptographic standards. Advances in quantum-based image steganography could conceivably extend to text, but this remains highly speculative.

A simple, robust, and inexpensive method for verifying the provenance and integrity of LLM-created content is, at this time, an aspiration. The lack of an airtight test for whether content was created by an LLM is naturally a significant concern for the national security community, which must grapple with issues such as LLMs’ ability to generate unlimited quantities of convincing falsehoods. An airtight test for LLM authorship would be extraordinarily meaningful, but entities that have no desire to watermark the output of their LLMs, or to not disclose that they have embedded watermarks in the outputs they do emit, cannot be magically made part of any watermarking regime against their will.

The makers of the most widely used base models may well be subject to sufficient regulatory pressure to play along in due course, but not so the open-source models that exist in no jurisdiction. Nation-states that wish to watermark outputs masquerading as legitimate outputs of their enemies will undoubtedly expend much effort to do so, just as they expend much effort to subvert their enemies’ cryptographic systems. In the absence of an airtight test of LLM authorship, there will be growing demand for alternative approaches—such as multi-factor functions to characterize the likelihood of AI versus human provenance as a probability rather than a definite proof.

Yet what sort of decision support does a probability of machine authorship actually provide? How do you calibrate probabilistic estimators when there is no ground truth—no way of knowing what texts, or portions of texts, were actually composed by an LLM but not acknowledged as such? This approach may make sense in sectors such as education, where the stakes are somewhat lower, and the threat model is that of LLM-generated content misrepresented as human-generated content. But appreciable fallibility cannot be accepted in most national security settings, hence the search for a better way to verify provenance and authenticity must continue; this is where innovative startups as well as existing tech companies have a role to play.

The parallels between GenAI and watermarks on the one hand and experience in the cyber domain on the other make cybersecurity the best field from which to borrow ideas and adapt methods for LLM purposes. These lessons include, among others, recognizing the urgent need for an explicit, use-case-based, bounded threat model in this field, or why watermarks should not be considered an airtight method for exposing content as being generated by LLMs. Cybersecurity’s history teaches us to be circumspect; the domain has progressed in part by learning lessons from decades of incidents, yet cyber practitioners are still finding new things they didn’t know they didn’t know.

As with cybersecurity, with watermarks we have sentient opponents, and so we suggest that the process of digital watermarking will fail to meet design aims, at least for a period of time. Even if we learn useful things from those failures, great care must be taken not to prematurely attempt to expand application of the technology from specific to fully general use cases.

While good-faith measures like the NDAA’s crowd-sourced competition are welcome, they are in no way sufficient. We also must understand the underlying incentives, disincentives, and politics that drive outcomes. Good engineering of outcomes begins with well-defined problem statements and a frank assessment of failure modes. We are not there yet. Learning from cybersecurity may be our best strategy for making progress.

Topics:

Cybersecurity & Tech

Back to Top