Is there a world where artificial intelligence developers could get the training data they need through content licensing deals? Matthew Sag argues that content licensing deals between developers of artificial intelligence and content owners are only possible for large content owners and cannot feasibly apply to the bulk of producers and owners of content on the internet.
Editor’s Note: This article is part of a symposium that examines the competitive risks posed by artificial intelligence content licensing agreements—mechanisms that on the surface appear to balance copyright protection with innovation. How might these deals compel us to rethink the relationship between copyright, data protection, innovation, and competition in the age of AI? How can we seek to uphold all four, while recognizing the inevitable tradeoffs involved? You can read the article from Mark Lemley and Jacob Noti-Victor (writing together), Matthew Sag, Christian Peukert, and Kristelia García here as they are published.
American courts are currently grappling with the question of whether training artificial intelligence models on data scraped from the open internet without express permission is fair use. The stark alternative is that it represents one of the most brazen acts of mass copyright infringement in living memory. How this question is resolved will determine whether and to what extent striking licensing deals for training data is necessary. Paradoxically, the practicality and feasibility of those same arrangements may also factor into the question of whether such AI training is fair use in the first place.
Since mid-2023, the companies working at the leading edge of AI development—OpenAI, Anthropic, Meta, and Google—have each entered into licensing agreements with major news organizations and other content controllers worth hundreds of millions of dollars. As a recent United States Copyright Office report summarizes, representatives of content industries proclaim their eagerness to enter into licensing partnerships for AI training. To many, such interest, combined with the deals that have been made to date, shows that licensing training data is possible. Thus, it follows that companies developing AI should be required to license training data. This article explains why these inferences are unsound. Licensing en masse will never be a realistic option for the development of large language models (LLMs).
The current AI training data licensing landscape
In July 2023, the Associated Press agreed to license, on undisclosed terms, a portion of its text news archive to OpenAI. This proved to be the first in a series of media deals struck by LLM developers, but of all of the AI developers to date, OpenAI appears to have been the most active. Following the deal with Associated Press, the maker of ChatGPT struck deals with Axel Springer (the owner of Politico and German papers Bild and Die Welt, the Financial Times, France’s Le Monde and Spain’s PRISA (owner of El País and La SER), News Corp (including the Wall Street Journal, Barron’s, MarketWatch, The Times (London), and The Australian), Condé Nast (The New Yorker, Vogue, Wired, Vanity Fair), The Atlantic, and Hearst (including newspapers such as the Houston Chronicle and the San Francisco Chronicle and magazines such as Esquire and Cosmopolitan). Furthermore, like its competitor Google, OpenAI struck a deal with Reddit in May 2024, giving it access to Reddit’s vast corpus of user discussions. These deals collectively give the content owners hundreds of millions of dollars and give the AI companies access to troves of valuable training data.
What exactly is being licensed and why?
The recent wave of agreements is striking, but it is easy to misconstrue their substance and implications. Let’s begin with substance. It is not clear if any of these licensing arrangements relate exclusively to clearing potential copyright hurdles in the training of AI models. For example, Reddit’s arrangements with Google and OpenAI seem to be more about access than copyright because Reddit does not own copyright in the user-generated content that makes its platform so valuable. Reddit users grant broad, non-exclusive rights to the platform, including the right to sublicence their content for AI training, but Reddit itself would never be able to sue the AI companies for copyright infringement because as a non-exclusive licensee it lacks standing under the Copyright Act. Likewise, OpenAI’s deals with global news publishers Axel Springer and News Corp amount to a fee for access to paywalled content and a license for the delivery of summaries of the relevant content in response to ChatGPT queries.
There are many reasons why access is worth paying for, even setting aside copyright risks. A developer might choose to pay for API access rather than relying on web scraping, as the former is more reliable. Changes to a website’s architecture can easily disrupt web-scraping programs. Moreover, gaining approved access to content addresses the potential risk of a breach of contract action, to the extent that website terms of service constitute an enforceable contract.
Nevertheless, copyright remains an important motivator. As I foreshadowed in “Fairness and Fair Use in Generative AI,” if the content is paywalled or otherwise content restricted, gaining approved access may be required under some interpretations of the fair use doctrine. Even if the fair use case for training a general-purpose model on works scraped from the internet is sound, certain agentic uses—uses involving autonomous AI decisionmaking—may raise different copyright issues. As LLMs increasingly have the capacity to interact with the world via tools and APIs, it becomes more common for their responses to feature retrieval-augmented generation (RAG). In other words, rather than simply responding with the most probable token that would follow a question, the systems offered by OpenAI, Google, Anthropic, and others will route the query to a web search if that seems appropriate and then feed the retrieved content back to the model for evaluation. If you ask Claude a question regarding the recent Google Search antitrust case such as “Why did the court just rule that Google will not have to sell its Chrome browser after all?” it retrieves content from about a dozen different news sources and provides an integrated summary of the relevant information. The increasingly agentic nature of LLMs means the model’s outputs can include up-to-date information, like current news, but it also introduces new copyrighted materials into the process at inference time. Experience suggests that RAG exposes the end-user to a non-trivial amount of direct quotation or light-paraphrasing that might cross the line into copyright infringement. This makes a case for obtaining licenses for such uses.
Will licensing scale?
Putting those uncertainties and qualifications to one side, there is no doubt that some of the licensing deals do relate to copyright permissions for AI training. But does it follow that, because some market participants are willing and able to come to terms, licenses can replace fair use in providing training data at large scale? This question has a normative and a practical dimension, but at least in the context of LLM development, the practical considerations are decisive. Licensing does not scale.
Why does scale matter? In the space of just a few years, LLMs have gone from being little more than a research toy to a seemingly world-changing technology with profound capabilities. This progress has largely resulted from scaling up both the models themselves and the datasets on which they are trained. Today’s leading LLMs are built at what proponents describe as “internet scale,” meaning datasets comprising trillions of tokens harvested from the publicly accessible web. Not every generative AI model is trained on such a massive corpus, but certainly all the current cutting-edge LLMs are. The major players in AI research are focused on LLMs both because these are general-purpose technologies and because they see them as stepping stones toward artificial general intelligence. Examples of weak small language models and high performing narrow generative AIs do not tell us anything about whether licensing can meet the training needs of cutting edge LLM development.
Not enough dance partners
Copyright licensing will not achieve internet scale because there are relatively few concentrated media interests with clear rights to large enough sets of potential training data to make them attractive partners. Striking deals with concentrated media interests is possible because it overcomes the valuation and transaction cost problems that would render licensing individual works impossible. Although training data is valuable in aggregate, the marginal contribution of any individual piece of training data, whether a book or a social media post, is approximately zero. When OpenAI agreed to a licensing deal with Condé Nast which allowed for the use of content from Vogue, The New Yorker, Vanity Fair, Wired, and more, it could place a non-zero value on the collection. Negotiating with one magazine conglomerate is more feasible than negotiating with tens of thousands of authors. Organizations like the Copyright Clearance Center (CCC), a major U.S. collective rights organization, claim to be offering AI licensing deals, but it is by no means clear that they have a mandate from the appropriate right holders to do so. As noted above, platforms like Reddit that host user generated content also often give themselves the right to license that content for AI training. However, there is not an unlimited supply of such gatekeepers who are not already subsidiaries of one of the major hyper-scalers.
A handful of cherries does not make a sundae
What is more, the few concentrated rightsholders who can overcome the transaction costs and make licensing worthwhile just don’t have enough data. As Peter Yu and I explained elsewhere, it would take the New York Times about 316,000 years to generate the 15 trillion tokens used to train Meta’s Llama 3 model. And even after all that time, these would not be the right tokens because they would lack the diversity that is essential for training large generalist models.
This is not to suggest that access to the New York Times would not be worth paying for. Perhaps its content is so unique and of such high quality that it is the cherry on top for the training data sundae, but a handful of cherries does not make a sundae. If access to web-scale data is truly necessary for the development of cutting-edge LLMs, then the existence of a handful of licensing deals with concentrated rights-holders does almost nothing to address the broader structural problems.
Large language model rights association?
Another reason that copyright licensing will not scale to meet the needs of LLM development is that the vast majority of works aren’t actively managed, let alone made available through any formal licensing system, nor are they likely to be. If using copyrighted material for AI training always required a license, access to most of the world’s creative output would grind to a halt.
The American Society of Composers, Authors and Publishers (ASCAP) and similar organizations offer blanket performance licenses for musical works on a massive scale, creating a market for performing rights that would not otherwise exist. Many imagine that collective licensing could bridge the gap between AI companies and millions (or perhaps tens or hundreds of millions) of copyright owners whose works have been used to train LLMs and other generative AI models. In the world of music, ASCAP can efficiently distribute money because it makes no payments to rights holders who have not earned enough to justify the costs of writing a check ($100) or making a direct deposit ($1). That works because there is data on what gets played, and all the copyright owners whose works are not performed often enough get nothing. But current technology doesn’t allow us to trace which individual works are used by AI systems, or to what extent. There is no reliable way to tell which books, articles, or images were most influential in shaping the output. That makes it nearly impossible to tie compensation to actual usage or any other measure of importance.
If there were an ASCAP for LLMs—let’s call it Large Language Model Rights Association or LLMRA—it would have no better option than to divide its pot equally among everyone with a potential claim: not just song writers, book authors, and photographers, but everyone who ever wrote a social media post or other such digital ephemera that made its way into the training data. It is hard to see how such an LLMRA would in fact send any payments to anyone except the large rightsholders who have already negotiated access deals. Whatever its supposed structure, LLMRA would not be in a position to make any payments to individual authors where the transaction costs involved in making individual distributions exceed the amount available for distribution. LLMRA could certainly collect rents, but it would operate more like a tax or the old system of buying indulgences than any system of collective licensing that exists in the U.S. today.
Conclusion
The impossibility of clearing rights at scale does not settle the normative and doctrinal arguments for and against treating AI training as fair use, but I have said enough about why nonexpressive use should be fair use elsewhere. This article addresses just one small piece of the debate, whether a private ordering solution is a feasible alternative to fair use. Despite the most ardent wishful thinking of copyright maximalists and private ordering enthusiasts, it seems quite unlikely that cutting-edge LLMs could be trained on a mixture of public domain and licensed materials without access to the volumes of training data developers have scrapped from the open Internet.
Author Disclosure: The author reports no conflicts of interest. You can read our disclosure policy here.
Articles represent the opinions of their writers, not necessarily those of the University of Chicago, the Booth School of Business, or its faculty.
Subscribe here for ProMarket’s weekly newsletter, Special Interest, to stay up to date on ProMarket’s coverage of the political economy and other content from the Stigler Center.





