Worried About AI Monopoly? Embrace Copyright’s Limits
Published by The Lawfare Institute
in Cooperation With
Big Tech is in the crosshairs. Critiques reverberate on both sides of the aisle. Antitrust lawsuits abound. Privacy problems persist. And now, a new front has opened: copyright—specifically, the claim that Big Tech’s generative artificial intelligence (AI) infringes copyrighted works. Some observers have even applauded copyright enforcement and expansion as a vehicle to weaken Big Tech’s monopoly power.
Attempts to use copyright to pursue such “antimonopoly” goals, however, need to keep in mind copyright’s competition framework. Copyright is built on not just rights that benefit authors but also limits on those rights that benefit everyone. Focusing solely on the former not only is ineffective to attain these goals (given the size of the largest copyright owners) but also fails to exploit copyright’s competition-promoting tools in fostering a robust AI marketplace.
Copyright’s Limits
Copyright protects original works of authorship—such as books, artwork, music, and movies—that are fixed in a tangible form of expression. Like other forms of intellectual property, copyright serves an instrumental economic purpose. It aims to benefit the public by supporting the creation of works that would not otherwise be economically viable, while simultaneously providing public access to those works. For example, if anyone can copy and distribute a book freely, then the price an author could charge would be the marginal cost of copying and distribution, which, for digital files, is close to zero. By providing creators of expressive works with certain exclusive rights—including the rights to copy and distribute their works—copyright allows the author to charge higher prices and, in that way, provides an incentive for them to write the book in the first place.
But the other side of the equation—limits on copyright that protect the public—also is critical. For starters, copyright protects only expression, not ideas, facts, processes, and other information. George Lucas could copyright the specific work that is “Star Wars,” but not copyright the idea of the hero’s journey or the concept of a fictional war set in space. Similarly, an author can copyright a specific biography of George Washington but cannot copyright the underlying facts that make up his life. As the U.S. Supreme Court has explained, copyright safeguards society’s “interest in the free flow of ideas, information, and commerce.” In addition, a robust “fair use” defense—representing “perhaps the ultimate” copyright doctrine “serving pro-competitive interests”—allows uses of copyrighted works in multiple settings.
In addition to U.S. law, copyright systems around the world also incorporate limits. For instance, the Trade-Related Aspects of Intellectual Property Rights (TRIPS) Agreement requires all 166 World Trade Organization member countries to adopt the “idea-expression” dichotomy, and national copyright systems have incorporated it through statute or judicial interpretations of domestic law. Even though not all countries have flexible, open-ended fair use standards like the United States, all countries have limits related to accessibility, research, education, and similar uses, and countries are increasingly incorporating specific limits related to text and data mining, including AI training.
Limits are essential in enabling socially beneficial uses that rightsholders might inhibit and in reducing transaction costs that might impede uses. Limits are particularly important because control over copyrighted works can serve as a bottleneck to new creativity and technological development. Prior generations of economic thinking contemplated rightsholders possessing perfect information and a frictionless ability to transact. This thinking anticipated that as long as copyright holders were granted broad, clear entitlements, markets would take care of the rest. After multiple decades and dozens of economic studies, however, it is clear that the market structure most conducive to innovation is far from settled.
History demonstrates the importance of copyright’s limits in addressing bottlenecks. More than 20 years ago, legal scholar Tim Wu noted that “[a]s the pace of technological change accelerates, copyright’s role in setting the conditions for competition is quickly becoming more important, even challenging for primacy the significance of copyright’s encouragement of authorship.” Wu explained that “it is essential that judges, lawmakers, and academics understand the effects of the law on parties other than authors.” With the rise of digital technologies, which must make copies to function, copyright has a greater effect than ever.
To be sure, copyright has not mediated competition between incumbents and upstarts in a purely uniform manner. That said, copyright has over time evolved in various ways that provide support to challengers and avoid foreclosure by incumbents. The U.S. courts, for example, refused to prevent cable TV from challenging traditional broadcasters. The Supreme Court treated cable systems carrying existing broadcast signals to viewers who otherwise could not reach them as enhancing viewers’ access rather than free riding that undermined copyright. The Court also found that VCRs were capable of substantial noninfringing uses and thus declined the movie and TV industry’s argument that manufacturers should be held responsible for potentially infringing uses. In both cases, copyright holders threatened to impose bottlenecks based on their exclusive rights, which would have blocked market entry.
Most generally, U.S. courts have found fair use where copying implicates uncopyrightable elements and uses of works that do not communicate the protectable, expressive elements to an audience. One can study a copyrighted book to glean uncopyrightable facts and produce a distinct work as long as that work is not substantially similar to the protectable, creative expression in the book studied. By the same token, in cases involving (for example) search engines, book digitization, plagiarism detection, and text and data analysis, courts have enabled productive new uses of existing copyright works, even where such uses involved copying at great scale as an intermediary step.
Just one example is provided by software company Connectix reverse engineering Sony’s video game system. By doing so, Connectix extracted uncopyrightable elements and made competing software that allowed PlayStation games to be played on personal computers. The court found that Connectix’s conduct constituted fair use. It acknowledged that “Sony understandably seeks control over the market for devices that play games Sony produces or licenses” but held that “copyright law ... does not confer such a monopoly.”
Copyright’s limits are particularly critical in addressing generative AI.
Application #1: AI Training
Developers of generative AI train their models on existing works. These works typically include hundreds of billions or even trillions of individual “tokens” of data. Maximizing the amount of data a model can train on is critical in reducing bias and increasing accurate predictions. Models derive ideas, facts, patterns, and other non-copyrightable insights from the sources they train on. Even though copies of copyrightable expression are created in the training process, the models themselves are not databases intended to retain copies of those expressions. While models may inadvertently “memorize” particular works from its training data (as they retain and can produce outputs that contain substantial portions of those works), developers can take steps to inhibit such “memorization” as well as “regurgitation” of material to users of the model.
More than 50 copyright cases against generative AI developers are currently pending in U.S. courts, and governments around the world are evaluating whether and how to update their laws in light of generative AI. Rightsholders and allied advocates often claim that requiring copyright licenses for training data would support competitive, open markets. To the contrary, such licenses threaten to (1) create new barriers to market entry and competition in new technologies, reinforcing large incumbent tech companies’ advantages, while (2) entrenching large copyright holders and (3) not significantly benefiting creators.
First, requiring copyright licenses for the use of existing works in training would impose new costs (including on both developers of AI and creators who use AI), which means new barriers to entry. There is no official registry of copyrighted works and owners, and existing datasets can be incomplete or erroneous. Consider just content on the Web; copyright attaches to material automatically, but there is no central clearinghouse for licensing content from sites, and pages and links rot and go out of date. The challenge gets even thornier when one considers that the rights to every comment on Reddit or in a local newspaper are owned by the commenters themselves. The vast majority of in-copyright books are out of print, and most are not actively managed by their rightsholders. In fact, to the extent training may be infringement, it can be unclear whether the authors or publishers control that specific right for a given book because the answer is contingent on publishing contracts not devised with this use in mind. And despite repeated attempts to create central databases of rights information, the licensing of music is notoriously complex.
The companies best able to absorb all of these costs are the existing, well-resourced incumbents—in other words, Big Tech. Incumbents already have access to large volumes of data they can use. For example, Meta’s and Google’s standard terms of service give them effectively unfettered legal access to the user-generated content posted on their services for use in AI training. Given that YouTube, Facebook, and Instagram each have more than 1 billion users, this is a significant advantage. Startups, developers, researchers, and everyone else who lacks such access to data will, at best, face extraordinary licensing costs, and more likely an inability to ever match well-resourced incumbents’ access to data. Incumbents, in other words, will be able to pull up the ladder behind them, precluding others from enjoying similar success.
Second, media markets are highly concentrated, and licensing fees would predominantly accrue to large incumbents. Media companies such as Disney have vast stores of content from across different media. Think not only of every movie and TV program from Disney (including LucasFilm, Marvel, and more), but also pictures and books about specific characters, every ESPN article, radio show, and podcast, and so on. Universal Music Group owns the publishing rights to over 5 million songs, and Getty claims to have 625 million assets in its collection. What’s more, these companies collectively “commission millions of works every year from working creators” and may act as aggregators, allowing works to be licensed in bulk. In fact, based on their experience with text and data mining and licensing, some researchers increasingly fear that “the goal of the entertainment and wider content industry” likely is “to control the AI production cycle – from the inputs and the application of the model to their outputs” even though “[t]his is not what copyright was intended to do, but aligns with a wider trend to seek to control and/or moneti[z]e all citizens’ and corporations’ use of information, and in turn stretch control over all aspects of the digital economy.”
The harms extend across society. As a general-purpose technology, generative AI is an input into a wide range of activities. Copyright holders might ignore productive uses that matter to society but yield limited revenue—for example, models trained for scientific research purposes, to study AI itself, or to engage in biomedical research.
Large copyright holders also might not be able or willing to license innovations that compete with or disrupt their existing business models. Clayton Christensen’s “Innovator’s Dilemma” explains that leading companies pursue “sustaining” incremental innovations but have failed to introduce more radical “disruptive” innovations.
The Innovator’s Dilemma underlies the fears—similar to those voiced about generative AI—that the content industries have leveled against previous disruptive technologies. In 2011, intellectual property scholar Mark Lemley detailed a range of such technologies that included photography, player pianos, gramophones, radio, cable television, photocopiers, VCRs, audio cassettes, digital audio tapes, MP3 players, peer-to-peer technology, DVRs, digital radio, and digital television. Just to give one example, Jack Valenti, then the head of the Motion Picture Association of America, famously lamented in 1982 that “the VCR [videocassette recorder] is to the American film producer and the American public as the Boston strangler is to the woman home alone.”
Valenti and those who share his views have cast these disruptive technologies as an existential threat to creativity itself. But instead, these technologies have repeatedly opened up opportunities for new types of creators and art as well as for existing creators. They also have created new markets, lowered authorship costs, and shifted (as opposed to eradicated) value.
The VCR, for example, created the home video market and resulted in revenues even higher than the box office. Sites such as YouTube lowered barriers for distribution, paving the way for video creators to create and share works outside the traditional movie and TV system. Similarly—as Katharine Trendacosta and Cory Doctorow put it in the context of generative AI image generation—“[f]or every image that displaces a potential low-dollar commission for a working artist, there are countless more that don’t displace anyone’s living—images created by people expressing themselves or adding art to projects that would simply not have been illustrated.” Tying this into copyright law, Fred von Lohmann explains: “[B]y encouraging technologists to invest in innovations that reproduce copyrighted works, the fair use doctrine may ultimately benefit copyright owners themselves, at least to the extent new technologies prove to enhance the value of copyrighted works.”
Third, against that backdrop, one might ask, what about the individual artist? Any licensing reallocation between large tech and large media companies is unlikely to be material for most creators. Because each individual work is only a tiny portion of the necessary training data, individual creators will get, at most, only the slightest amount of revenue. For example, StabilityAI, one of the initial companies to invest in and commercialize text-to-image generator Stable Diffusion, was valued at roughly $1 billion at the time the model rose to prominence and was trained on more than 2 billion images. Even if all of that company’s value were liquidated and went directly to artists, without any middlemen, that is still just a one-time check of $0.50 per work.
Back-of-the-envelope math for other companies and types of models is no more encouraging. In fact, AI developer Anthropic recently settled a copyright lawsuit brought by authors for $1.5 billion, and, while that number is large in absolute terms, the authors’ share is relatively small. Publishers and the authors’ legal representation stand to receive more than 60 percent of that total, and authors who have rights to covered books in the lawsuit will generally get only $1,500 per work.
As Xiyin Tang has shown, the ownership of copyrights is not “a diffuse web of interests spread out amongst individual creators, artists, and authors” but, instead, is “concentrated in a handful of large corporations.” Tang notes that “[c]hanges in copyright laws that increase the cost of content licenses ... will only enrich the long-standing dominance of traditional content licensors,” while also “entrench[ing] and concentrat[ing] licensees, creating a bilateral oligopolistic market for copyrighted works.” She concludes that, “[t]o fight monopoly, to be truly neo-Brandeisian, one must think beyond copyright law.”
Application 2: “Shadow Libraries”
Much of the debate about AI training has focused on companies crawling public websites to collect training data. In addition to website text, developers may use other types of media, including books found online or digitized from offline copies. For example, in 2015, researchers created BookCorpus, a “corpus of 11,038 books from the web,” derived from copying from an independent books distribution site called Smashwords. Another example is the Books3 dataset, which was developed by researchers and includes 170,000 books downloaded from sources not authorized to distribute all of the works contained in the dataset. In other words, the books were originally downloaded from sites dedicated to piracy, sometimes called “shadow libraries.” Datasets such as these, and many others containing unlicensed copyrighted works, were bedrocks of AI research and advancements, as increases in the size of datasets led to improvements in AI models.
As the use of such datasets grew beyond research settings, they have become another, distinct point of contention. Rightsholders have contended that many developers have unlawfully relied on pirated copies of books and thus their model development is infringing. Developers as well as other copyright scholars have responded that their acquisition of the works is merely an intermediate step to a lawful act—training a model—and the books are not otherwise used for reading or other purposes that interfere with rightsholders’ legitimate interests.
So far, two courts have addressed this issue. While both ruled that the developers’ use of the books for training was fair use, they reached different conclusions on how the developers used books. One ruling found no issue with the use of “shadow libraries,” but the other found that downloading and keeping a “permanent library” of these books was unlawful and questioned whether use of these works could ever be legal.
While the use of pirated works may raise copyright issues distinct from those posed by information posted publicly on the Web, the way copyright regulates in this setting has clear impacts on competition. Prohibiting this use would reinforce market power since every developer would need to independently scan millions of books as a condition of market entry. In particular, Google would have a huge competitive advantage, as it has already invested hundreds of millions of dollars in scanning 40 million books. Only the largest tech companies could afford to duplicate this effort. And even that assumes that research libraries, which have already received scans from Google, would cooperate with other companies.
Here, too, the potential benefit for authors is relatively limited. If developers cannot use these shadow libraries, then they would scan second-hand books or buy a single copy of a new book, thereby yielding little if any revenue for rightsholders.
Application 3: “News Summaries”
In addition to challenging the training of generative AI models, rightsholders are also suing developers with respect to “inference,” which is what happens when a user enters a prompt into a model or otherwise gives it material to analyze to produce an output. The operators of generative AI tools may copy and use material as part of this process. For example, when someone inputs into an AI tool such as Perplexity a prompt that reads something along the lines of “what’s going on with tariffs,” Perplexity will fetch relevant pages from the Web related to that query, use its AI model to analyze those pages, and then produce a response. Similarly, visitors to Google now frequently see “AI Overviews,” which provide natural language answers to queries, as opposed to a list of links to third-party sites.
News publishers have alleged that this use is damaging to them and infringes copyrighted works. They say these summaries effectively substitute for the content on their website. Users, satisfied with the answer they have received from the AI tool, may have less reason to click through to the publisher’s site (if a link is even provided), and thus publishers lose opportunities to make money from their works. Other publishers of facts, such as recipe websites and Encyclopedia Britannica, have raised similar concerns.
Developers, in contrast, argue that while they need to make a copy of a page to analyze it, the outputs are non-infringing. Rather than copies of the copyrightable expression, they merely reuse facts, ideas, or other uncopyrightable material.
This debate matters from an antimonopoly perspective. What news publishers are asking for is effectively a double standard: AI developers would not be allowed to do to them what they routinely do to other newspapers and to other third-party sources like books they cite in a story or review. News publishers often use facts copied from third-party sources, including competing news sources. One of countless examples is provided by a quotation by former U.K. deputy prime minister and high-ranking Meta official Nick Clegg. The Verge published an article called “Nick Clegg says asking artists for permission would ‘kill’ the AI industry,” which incorporated and summarized facts from a similar article in The Times. While such use of third-party material competes with and may substitute for the original, it is not infringing unless the articles include substantially similar, copyrightable expression from the original article and are not covered by copyright’s limits.
More generally, it is worth considering what it would mean to create property rights in facts and how that would affect competition. In essence, it would allow anyone who found and first wrote down a fact—about the news, a recipe, or the weather—the ability to act as a bottleneck blocking anyone who wished to use that fact. While that may benefit incumbent news publishers (or recipe sites or encyclopedias), it would hinder competition and innovation. The European Union offers a cautionary note, as its implementation of special protection for databases has not had the promised pro-competitive effect of incentivizing investment in new databases.
***
Copyright’s limits play essential antimonopoly functions. They can and should continue to do so in the context of AI.
These limits, of course, are not by themselves sufficient to fully achieve antimonopoly outcomes. Access to data is only one of many considerations when it comes to possible antimonopoly approaches to governing AI. Just as copyright cannot solve many policy problems, copyright’s limits also are not all-powerful, and incumbents may be able to coopt disruptive innovation. For that reason, it is important to look outside copyright to find more tailored measures. To offer two examples, copyright-related conduct could represent illegal antitrust tying of products, and labor law could be more effective than copyright law in supporting creators.
These non-copyright policies can address concentrated power without triggering the monopoly-promoting, competition-harming effects of expansive copyright. The wave of copyright lawsuits against generative AI companies threatens to entrench large tech and large media companies with little benefits to creators, let alone the public at large. Allowing text and data analysis, including training and use of AI models, offers a powerful tool against this consolidation. Many of the lawsuits are against new entrants. As even proponents concede, attempts to limit fair use based on “stature”—such as treating Big Tech differently from startups—can easily backfire. In short, anyone wishing to effectively attain antimonopoly outcomes should embrace, not attack, copyright limits.
