Christian Peukert argues that the market for licensing content from copyright owners like newspapers or online forums requires a standardized regime if access to this data, used to train artificial intelligence models, is to remain available for more than just the largest AI firms. A failure to maintain non-discriminatory access will result in the consolidation of both the AI and content production markets.
Editor’s Note: This article is part of a symposium that examines the competitive risks posed by artificial intelligence content licensing agreements—mechanisms that on the surface appear to balance copyright protection with innovation. How might these deals compel us to rethink the relationship between copyright, data protection, innovation, and competition in the age of AI? How can we seek to uphold all four, while recognizing the inevitable tradeoffs involved? You can read the article from Mark Lemley and Jacob Noti-Victor (writing together), Matthew Sag, Christian Peukert, and Kristelia García here as they are published.
Generative artificial intelligence systems like ChatGPT, Gemini, and Claude depend on vast amounts of human-created data, text, images, and sound that teach machines to produce outputs that generate tremendous value to end users. Much of this material is copyrighted. As the legal and reputational risks of unlicensed scraping have grown, leading AI developers have moved to formalize access through licensing agreements with content owners. Over the past two years, OpenAI has signed deals with Reddit, The Guardian, and News Corp. Google, Anthropic, and Amazon have followed suit. Most recently, record labels Universal and Sony have announced license deals with Udio and Spotify to enable business models around AI-generated music. These arrangements, which can involve multimillion-dollar payments or privileged access to AI tools, are framed as balancing innovation with fair compensation for creators.
Yet beneath their pragmatic surface lies a deeper structural shift: the move from an open web to a restricted data economy. By transforming human expression into a tradable input, data licensing deals redefine not only the economics of creativity but also the competitive landscape of AI itself. Content licensing has become a form of infrastructure for the AI economy. Whoever controls access to large, high quality data archives—be it publishers, platforms, or data brokers—now occupies a pivotal position in shaping innovation. Due to financial restraints and the exponential returns on access to better data, only a few large AI corporations are likely to be able to purchase access to this quality data. To maintain a competitive AI market, we need a statutory or collective licensing regime that standardizes access to content and ensures that all qualified developers can train models under transparent, non-discriminatory terms.
The consolidating effects of licensing
In the early years of AI, firms treated data as abundant and effectively free. In effect, it was, as content platforms seemed little concerned about having their data scraped through initiatives like CommonCrawl. Instead, the challenge was technical: how to scrape, clean, and process content at web-scale.
The early days are now over. Today, with immense realized economic value at stake, accessing data has become much more of a legal and economic challenge. Content owners have asserted their rights over their archives, sometimes through multi-million-dollar lawsuits, and access now depends on contracts and payments. For established AI players, this formalization provides legal certainty and reputational safety. For newcomers, it can create barriers to entry. Licensing thus converts what was once an open commons into a network of private pipelines. Each deal determines who can train on which data and under what conditions. As these contracts multiply, they form a patchwork of access that favors firms with deep pockets and legal infrastructure. The very act of making AI “responsible,” in the sense of complying with intellectual property and data protection laws, risks also making it exclusive.
Licensing introduces fixed costs that only a few players can bear. Negotiating, auditing, and integrating large content agreements requires specialized teams and substantial capital. These are prohibitive costs for smaller AI firms. Moreover, these deals can be exclusive or de facto exclusive. A single multi-year contract can tie up an essential dataset, such as decades of news reporting, scientific literature, or user discussions, and prevent rivals from accessing it on comparable terms. Even when exclusivity is not explicit, preferential terms, volume discounts, or early access can create similar effects.
Such arrangements shift the locus of competition. A prerequisite to compete in the market for AI models is to compete in the market for training data. Once one firm becomes dominant, because it has access to richer data or better integration, the market can tip further in its favor. Economists have argued that scale advantages on both the demand and supply sides, as well as user feedback loops, can reinforce the leader’s position. The next generation of challengers faces an even steeper climb, as they lack both users and the unique data that underpin leading models. When data, compute, and distribution are all concentrated within the same few firms, the risk is not temporary dominance but structural entrenchment.
Licensing also reshapes power on the supply side. Only the largest content owners—major media conglomerates, scientific publishers, or social platforms—can negotiate directly with AI companies. Like large AI firms, they have both the bargaining leverage and the infrastructure to manage complex contracts. Smaller creators, such as independent journalists, artists, and community forums, do not. Their content may be equally valuable for training but harder to track, license, or enforce. As a result, they are often left out of negotiations entirely. Somewhere in the middle are collective management organizations, which have begun experimenting with licensing their members’ works for AI use. Examples include the Copyright Clearance Center’s AI reuse license in the United States, the Copyright Agency’s AI-covered business license in Australia, and VG WORT’s AI license in Germany.
The likely outcome is a dual consolidation: fewer major publishers controlling content supply, and fewer major AI firms controlling demand. In the short term, vertical licensing arrangements may appear to improve efficiency: AI firms get high-quality data, creators get paid, and consumers gain access to more reliable systems. But the longer-term effects on dynamic competition, the capacity for continual entry and innovation, are more concerning.
Consequences of concentrated licensing
Unlike other inputs, data can depreciate quickly. The world changes, new knowledge emerges, and language evolves. AI models trained on static archives inevitably lose relevance. An example is automated news recommendation: in a large-scale field experiment, colleagues and I showed that the effectiveness of personalized news recommendation, in terms of click-through rates, drops to zero after just one and a half days without adding fresh personal data on a user’s click history. While not every domain is as fast-paced as daily news, the idea is general. As the world evolves, yesterday’s data loses informational value, making it necessary to update. Sustaining the value of AI requires a continuous flow of new content and the freedom for new developers to access it.
Licensing that concentrates access to a few firms undermines both. Content creators react to perceived unfairness or exclusion. If they see their work enriching AI systems without recognition, they may disengage, restrict access, or reduce the quality and frequency of contributions. This behavioral feedback is more than anecdotal: it can reshape the long-term availability of human-generated data. As creators retreat behind paywalls or closed platforms, we see the open web that sustained early AI development shrinking. Models by smaller AI firms trained on narrower, older, or more repetitive data perform worse, reinforcing the advantage of incumbents who are able to afford privileged access to curated archives. In this sense, if we fail to reward contributors, it can lead to a self-reinforcing scarcity: less open contributions lead to lower data quality, which results in a lower quality of models that do not have access to licensed content. Some AI firms will be forced to leave the market. As a result, there will be less competition in both the market for AI models (raising prices for end users) and the market for training data (lowering prices for content licenses).
There are further consequences to restricted access to content in addition to market consolidation. When models are trained predominantly on professional, corporate, or institutional content, they reflect the biases and priorities of those sources. This may improve reliability in certain contexts but reduces representational diversity. Languages with less commercial content or communities not served by major media risk being underrepresented in training data. As AI systems become increasingly integral to communication, research, and cultural production, such biases translate into informational inequality.
An alternative licensing regime
How then do competition authorities make sure that entrants and smaller players can continue to access quality data to create competitive products? The intersection of copyright and competition law is often treated as a source of tension: the former protects exclusivity to reward innovation, while the latter promotes openness to spur market churn and its benefits to consumers and other stakeholders. But in the case of AI training data, copyright policy could perform some of the functions of competition policy. A statutory or collective licensing regime could mitigate the structural frictions of the current market by both remunerating all content owners and keeping access open to all AI firms. Instead of bespoke deals between a few powerful AI firms and a few major publishers, such a framework would standardize access to content and ensure that all qualified developers can train models under transparent, non-discriminatory terms.
In practice, such a system would likely take the form of a statutory license administered by a public authority or collecting body that represents online content creators and publishers. Rather than negotiating thousands of separate agreements, AI developers would obtain a blanket license allowing them to train on copyrighted web material in exchange for a regulated fee. A rate-setting body—analogous to copyright tribunals or the U.S. Copyright Royalty Board—would periodically review and adjust fees based on evidence about the value of different types of data and their role in improving model performance. Payments would flow to rightsholders through existing collective management structures, ensuring that both large and small creators receive compensation. Larger AI firms would bear a greater share of total payments, reflecting their higher demand for data and the greater gains they derive from improvements in model quality. Such a system would not eliminate the need for competition oversight, but it could provide a balanced framework that preserves incentives to create while sustaining open access for innovation.
Critics might object that statutory licensing reduces incentives for private negotiation, but the alternative, a patchwork of exclusive deals, poses far greater risks to competition and diversity. In effect, a well-designed statutory license could preemptively address what competition authorities might struggle to remedy ex post, or once the harms to competition have happened. The AI industry already exhibits characteristics of competition for the market rather than in the market: massive upfront investments, scale economies, and feedback loops that favor the winner. Once the market tips, potentially accelerated by exclusive data access, it will be difficult to restore competition.
Conclusion
Licensing agreements are reshaping the foundations of AI. They promise to solve some legal and economic problems of the data economy but risk creating new forms of concentration, both among AI firms that control access to data and among publishers that control its supply. Left unchecked, the result may be a market that is safe but static: fairer towards data creators on paper, yet less open and less dynamic.
Copyright policy could help prevent this outcome by acting, in effect, as a form of ex ante competition policy that predetermines what a competitive market should look like. A statutory licensing regime would simplify transactions, reduce bargaining asymmetries, and keep the playing field level. By setting transparent, non-discriminatory terms for access to creative works, such a framework would sustain quality and differentiation on both sides of the market. It would not replace antitrust enforcement but complement it, targeting the bottlenecks where market power now forms. The challenge is to design copyright rules that preserve incentives for creation while ensuring that access to human knowledge remains broad enough to sustain rivalry and renewal. If data is the infrastructure of intelligence, it must be governed as such: a renewable resource, maintained collectively rather than enclosed privately.
Author Disclosure: I have received research funding from the National Bureau of Economic Research (United States of America), New York University NET Institute, Portuguese Foundation for Science and Technology, the Swiss National Science Foundation and the World Intellectual Property Organization. Either through funding or data sharing, my research was supported by organizations in the logistics sector (DHL Supply Chain), the news and media sector (organizations in Austria, Germany and Switzerland), the software sector (Github, Google) and the telecom sector (Service provider in Southern Europe). I have provided expert testimony in oral and written form for the European Parliament. I am an academic member of the United Nations’ working group on Data Governance. My interests as a rights holder are represented by VG WORT and GEMA, where I am a member. You can read ProMarket’s disclosure policy here.
Articles represent the opinions of their writers, not necessarily those of the University of Chicago, the Booth School of Business, or its faculty.
Subscribe here for ProMarket’s weekly newsletter, Special Interest, to stay up to date on ProMarket’s coverage of the political economy and other content from the Stigler Center.





