Cybersecurity & Tech Foreign Relations & International Law

In the AI Race, Copyright Is the United States’s Greatest Hurdle

Tim Hwang, Joshua Levine
Monday, June 30, 2025, 8:30 AM

Domestic battles over copyright will define whether the U.S. emerges as the definitive leader in the technological race with China.

The flags of the U.S. and China. (U.S. Department of Agriculture, https://tinyurl.com/mv359ee9; Public Domain, https://creativecommons.org/publicdomain/mark/1.0/)

Published by The Lawfare Institute
in Cooperation With
Brookings

It is no secret that dominance over cutting-edge technologies will play a major role in the geopolitical competition between the U.S. and China. Technological leadership will play a role in defining not just the economic health of each nation, but its military and soft power assets, as well. Artificial intelligence (AI) has emerged as one pivotal area in this competition, with both nations working to accelerate their capabilities in the technology and secure the necessary inputs for further development. China’s national and provincial governments are taking steps to create the infrastructure and regulatory regime to empower AI development and diffusion. The United States is currently ahead in this race—its companies are making the most dramatic breakthroughs in the technology and are implementing AI at scale through the economy. But this lead is fragile. In the United States, leading AI labs are facing an existential threat: copyright lawsuits. In these suits, rights holders argue that companies whose training data includes copyrighted material obtained by scraping the web without their express consent is a violation of copyright law. Because of the vast amount of data included in such training sets, the potential copyright penalties would bankrupt many AI developers. Though little talked about in discussions of geopolitical competition, it may ultimately be domestic battles over copyright that define whether the U.S. emerges as the definitive leader in the technological race with China, or falls behind. To remedy this issue and ensure the U.S. stays ahead, Congress – or ideally the courts – should take the bold and important step of affirming the legality of using publicly available data for training AI models in the United States.

China’s Data Directives

China is well-poised to catch up to the U.S. in AI and in some respects may already be ahead. One factor of AI production the Chinese government is working to support is access to high-quality, machine-readable datasets. The production function for AI includes talent, compute, and data. While all three inputs face bottlenecks, scarcity of this data is an issue that is increasingly relevant. According to AI research organization Epoch AI, current trends of data usage for model training will utilize the entire stock of human-generated public text sometime between 2026 and 2032. The size of datasets used to train frontier large language models (LLMs) is doubling approximately every seven months. Large amounts of data for model training is a prerequisite for pre-training, a necessary early step in the process of AI model development, as well as fine-tuning, a later process that necessitates more targeted or specialized datasets to impart specific knowledge or modality. 

The Chinese government has adopted a two-pronged approach to ensure domestic model developers, particularly those partnering with state-supported industries; research and development (R&D); and academic institutions have the data necessary not just to compete with U.S. labs, but to best them.

The first prong consists of updated laws and regulations related to acquiring and using data for model training. In June 2023, the Cybersecurity Administration of China (CAC) issued new guidelines for AI service providers, which are the entities that train and provide models. Within these guidelines, there are restrictions on the types of data that model developers can include within the dataset used for pre-training and fine-tuning. These restrictions most often require respecting intellectual property, such as copyrights, as well as filtering out data sources that would undermine core socialist values. These guidelines, however, do not apply to models that are not available to the public in China, or that are being used to support industrial activities, R&D, academic research, and other tasks that would support Chinese techno-industrial capacity.

DeepSeek offers a good case study in how these laws are being applied. Upon its release, many observers noted that the DeepSeek application and web-based model refused to answer queries about Tiananmen Square or criticize the Chinese regime and that it collected keystroke and query data that was then stored within mainland China. Such practices are standard fare for Chinese technology companies, as illustrated by the data storing practices of Chinese-owned platforms such as TikTok, Temu, and Shein. But DeepSeek’s VL model paper cites its use of Anna’s Archive, a “shadow library” (an online repository of freely available books, including pirated works) as a source for Chinese and English language texts used in the training data. Recent data shows that 30.71 percent of monthly active users of DeepSeek are in China. DeepSeek’s use of Anna’s Archive, and other open corpus for training data without express consent from the content owners, would technically violate China’s laws governing data access for the development of AI models. Such facts coupled with DeepSeek’s self-censorship demonstrate that the Chinese Communist Party (CCP) is more concerned about model outputs violating their speech codes than about respecting copyright, particularly when it comes to training data.

The second prong of China’s approach deals with the creation and availability of training data. The Chinese state is addressing this problem in a few ways. The National Bureau of Statistics, along with other state organs, issued a document outlining a plan to make datasets available to developers across the country and create “data exchanges” to increase access and ease portability of data throughout the Chinese AI ecosystem. This complements local government actions in Beijing, Shanghai, and Shenzhen, among others, to collect public data and make it available for model developers and businesses to accelerate commercial and research applications of AI. In October 2024, the Central Planning Committee of the Communist Party and State Council declared their intention to lift barriers to the availability and use of public data to further promote the burgeoning “data ecosystem,” and make such data available to enterprises and developers within China. According to the committee, this will ensure that Chinese firms will have access to the necessary inputs for large, broad-base foundation models, as well as for fine-tuning models for narrow applications such as advanced manufacturing, robotics, dynamic traffic management, and other techno-industrial activities.

Beyond setting up physical and digital locations to house and move data, the Chinese government is working to ensure that such data is actually usable. AI models are trained on both structured and unstructured data, the former following strict schema and organization, while the latter is more diverse and disorganized, depending on the algorithmic methods being used and type of model being built. According to the guidelines, “new measures will be implemented to provide professional education and training, to improve the professional skill levels of data annotators, and to establish a comprehensive talent pool for the new industry.” These individuals will improve the data and thus encourage firms to rely on such intermediaries for a key input for AI model development. This effort, combined with those noted above, illustrates how the Chinese government is leveraging legal edicts and state power to ensure domestic firms have the inputs necessary to advance AI development.

Access to publicly available, uncopyrighted data for training and developing LLMs in the United States pales in comparison. While data.gov makes thousands of datasets available, and the Department of Commerce has put out guidance on leveraging open data for generative AI, each federal agency has different data access policies, which impact the structure and usability of such data for model training. Agency and department policies can vary significantly with regard to how data is made accessible, whether through APIs, through individual file formats such as spreadsheets, or in physical form. Further, a vast majority of government data is housed in banker boxes rather than on cloud servers. Because such data is not readily nor consistently available, it hampers internal development of AI tools, as well as limits how third parties and the private sector can leverage public data to support new AI products and services.

The Potential Peril of Copyright Lawsuits

Copyright law defines the rights that authors of creative works have over the reproductions of their work. By default, any copy made without permission is an infringement of copyright. Under the Copyright Act, damages are statutorily set between $750 and $30,000 per work but can be increased up to $150,000 per work if infringement was intentional. 

Industrial-era copyright rules sit uneasily with the modern requirements of AI models, which learn through extracting patterns in vast amounts of data. Products such as ChatGPT or Claude simply cannot be created without access to these datasets, which are pulled both from the open web and through proprietary sources. The problem is, of course, that this training process copies, without permission, content created by others.

To date, there are more than 40 lawsuits against frontier American AI labs. Media outlets such as the New York Times and the Intercept have sued OpenAI, alleging mass copyright infringement. Developers have sued Github and Microsoft for using their code in training systems that can generate computer software automatically. The Recording Industry Association of America has sued companies such as Suno and Udio for using their songs to train music generation models.

At the heart of these disputes is a legal ambiguity about whether or not the copying that takes place during the training of AI systems should be considered a fair use, a doctrine under U.S. law that balances the interests of rights holders by protecting certain forms of copying that are in the public interest. Whether or not a given activity is a fair use is governed by a four-factor analysis that examines the character of the use, the nature of what is being copied, the amount of the work being copied, and the impact on the market for the copied works.

Courts have not always applied this doctrine consistently or articulated clear rules, particularly in highly technical, fast-moving domains like artificial intelligence. It is precisely this legal uncertainty that has allowed rights holders to credibly threaten to shut down a core input to progress if they do not receive their licensing fees. The present wave of litigation shows that rights holders are willing to use their leverage to the utmost, even if it stifles the next generation of U.S. leadership in the global competition for technological advantage.

Two cases that have helped shape the application of copyright law and the fair use doctrine with regard to digital technology are Google v. Oracle and Authors Guild v. Google.

The former case covered Google’s use of declaring code (code that provides information to direct the capabilities of software and where such code is located on a computer) from a copyrighted software platform to enable software developers to create new software and applications for its Android ecosystem. Oracle sued, claiming this constituted an infringement on its copyright. The court sided with Google, recognizing that the code Google copied was “functional in nature” and bound to uncopyrightable elements related to the Java language. Further, the court found that the use of the copied code was not merely to support Google’s own platform, but “to permit programmers to make use of their knowledge and experience.” To put this in the context of training generative AI models, developers scrape data from the web to build training sets that help models learn the functional, nonexpressive elements of information and language, which enables them to support human creation and development.

The latter case involved the creation of Google Books, a search engine that could identify specific books to support research and knowledge exploration, based on millions of copyrighted books. Google, and its library partners, were sued by the Authors Guild for scanning the books to create the tool. The Second Circuit held that Google’s copying and subsequent creation was indeed a fair use, specifying that the tool’s ability to search and provide information on relevant books was transformative, as users could not read books in their entirety, and the tool furthered research and the production of knowledge. The use was decidedly transformative: Nobody is able to use Google Books to read a book in its entirety. But they can use the search engine to support their research into a specific genre or author. Contextualized to generative AI, users are not going to use an LLM to access full-length editions of “Harry Potter,” but they may use it to think critically about the book’s plot or identify common themes to help with a book report.

Of the more than 40 ongoing cases, the three most notable are New York Times v. Microsoft, Kadrey v. Meta, and Bartz v. Anthropic, PBC. The plaintiffs in all three cases are each seeking billions of dollars in damages from AI labs because of the use of copyrighted material as part of training datasets. The model developers are claiming that the use of copyrighted works in training data should constitute a fair use for many of the same reasons described in the cases above: The models are learning from the functional, nonexpressive features of text and computer code in order to be able to perform tasks and provide assistance for things such as research, code-assistance, and content creation. They illustrate the public benefits these models can contribute to, as well as the transformative nature of their use of the works. In their view, AI developers are not creating plagiarism machines, but new tools that can assist in the pursuit of knowledge and novel expression.

Given the scale of the data involved, plaintiffs in these cases are able to allege billions of dollars in damages despite by and large not experiencing any real physical or financial harm. If affirmed by the court, the end result of these cases will be a legal regime wherein AI can be produced only upon paying a toll to license the data from all the rights holders that may have their content included in a training dataset.

This will be a practical impossibility for even the largest and best-resourced companies in the space. As researchers and experts have highlighted, creating such a licensing regime faces several hurdles, including quantifying the contribution of each individual source within the training dataset, convincing a diverse set of creators to accept one market-clearing price for each use of their work, and creating a centralized licensing authority. One-off deals between individual media organizations and frontier labs, as well as new firms offering technical solutions to enable artists to individually monetize their work, show that there are mechanisms to compensate individuals that do not include a one-size-fits-all licensing regime. Rather than spending valuable time and capital arguing about royalty rates, leading labs should be focusing on making technological breakthroughs that preserve America’s AI edge, not providing rents to incumbents.

Maintaining a balance of copyright protection while permitting the development and diffusion of new technologies has been critical to America’s competitive edge in emerging technologies. The explosion of innovation that powered the rise of American leadership in technology during the 2000s was accompanied by a similar wave of copyright lawsuits. In Perfect 10 v. Amazon, rights holders sued, claiming that the caching of thumbnails necessary to make image search features possible constituted massive copyright infringement. In Authors Guild v. Google, publishers sued Google, claiming that the scanning of books by Google to create an online searchable database was also infringing.

Both of these cases are key to understanding the relationship between training AI models and access to information. To develop platforms and tools that are usable and supportive of new uses, access to raw information is invaluable. The value of such works is a question that is being raised by plaintiffs in the current slate of cases against AI model developers. It is reasonable for some to feel uncomfortable or outraged that technology companies are using people’s works in ways that they never comprehended. Film studios and some newspapers have begrudgingly embraced the technology, and opted to enter partnerships with model developers. Technology has certainly reshaped the economic landscape for creative endeavors, but on the net has grown the size and scope of the market for nearly all forms of media, if it has introduced new winners and losers. As noted earlier, compulsory licensing requirements would create a subpar equilibrium for model developers and individual creators. As new technologies have changed markets for content consumption, they have been followed by new revenue models and forms of expression, as is beginning to happen with AI. Such experimentation and emergent order should be allowed to develop as it relates to generative AI and its hunger for data.

The digital tools at the center of the litigation described above were not created to misappropriate an individual’s work or their words. Rather, they enabled a new, transformative use that expanded the potential market for the original work, while also creating a host of new applications. Again, the original justification for copyright protection within the Constitution is to support progress in science and the useful arts, and applications such as general search engines and Google Books have done just that. They have made it easier for people to pursue new activities, acting as an engine for creative endeavors in science and the useful arts, much like AI models promise to be.

Before search engines, fair use has been a catalyst for innovation and market-making for traditional and novel mediums alike. Cases such as Sega v. Accolade and Sony Computer Entertainment Inc. v. Connectix Corp demonstrate how fair use enables innovation, creates economic opportunities, and furthers the original intent of copyright: promoting progress in science and the useful arts.

Beginning with Sega, the court recognized Accolade’s reverse engineering technical aspects of a system to enable the creation of games that were interoperable with the Sega Console. Analyzing text and programming work to enable the creation of text or code is a boon to technologists and creators alike. Accolade’s copying was not to rip off Sega’s existing games, but to enable the creation of new games that were functionally compatible with the system, expanding the market for Sega’s console and video games broadly. While AI models may enable people to create content that is in the same market as a New York Times reporter or a software developer, it is the functional skills AI models are learning, the ability to write or code, that will support new creative endeavors.

Similarly, in Sony Computer Entertainment Inc. v. Connectix Corp, again, courts found that intermediate copying in the context of reverse engineering qualifies as a fair use, even when copying the entirety of the functional code. The focus in this case was Sony’s BIOS program, which Connectix reverse-engineered to permit individuals who owned PlayStation games to play them on a computer in addition to Sony’s console. In this context, the functional elements of the BIOS program were not used to take advantage of Sony. Instead, they expanded the market for their product, supporting the incentive to create more games for the PlayStation. Rather than ripping off Sony, Connectix expanded where and how it could be enjoyed, creating continued growth in the market for Sony hardware and software.

The core debate in these cases was the same: Do uses of copyrighted material that facilitate transformative new technologies constitute a fair use? Luckily, courts in each case answered in the affirmative and allowed these new products and services to move forward. This created the enormous benefits of new technologies to flow to the public, rather than shackling innovation to the veto of established interests.

The Role of Copyright in Great Power Competition

There is nothing sacrosanct about copyright, a body of law that emerged in the era of newsprint and phonograph records and has evolved alongside new technologies. No one could have imagined then that copyright rules would have implications for great power competition. And the fact is that the narrow self-interest of a few industries to extract a tax on a major innovation threatens the viability of U.S. geopolitical competitiveness in a pivotal technology. This is not just a matter of commercial hardball, but a legal stratagem that threatens the United States’s broader national security.

These lawsuits are an existential threat for these firms. The U.S. is the world leader in AI model development. It is home to 75 percent of the world’s AI supercomputers, the infrastructure necessary to train cutting-edge AI models. But such comparative advantage will be wasted if copyright lawsuits obstruct access to an invaluable input for AI model development: data. AI innovators fleeing the U.S. for safer legal pastures would be a gift to the Chinese AI industry and its CCP supporters, as it would hamper existing and future R&D happening domestically. There are already a few countries that can act as a safe haven. Japan, Singapore, and Israel—among others—have all taken steps to modernize their copyright laws to allow for the use of copyrighted data to train AI models. 

The government must act to protect U.S. leadership in AI by definitively putting this issue to rest and enshrine a protection for AI training within copyright law. There is opportunity across the government to act now: for the courts to take a clear stand in their rulings, for Congress to pass legislation clearly defining a standard, and for the White House to align its agencies in a pro-innovation direction. Technological competition moves fast, and the current state of affairs is moving too slowly to resolve the chilling effects that these lawsuits are likely to have on the nascent AI industry. Failing to do so may not just harm the U.S. economically, but threaten our access to a key technology in a time of intense global competition where we will need every advantage we can get.

Tim Hwang is Substack’s general counsel and the author of the book “Subprime Attention Crisis: Advertising and the Time Bomb at the Heart of the Internet.”
Joshua Levine is a Research Fellow at FAI. His work focuses on policies that foster digital competition and interoperability in digital markets, online expression, and emerging technologies.
}

Subscribe to Lawfare