In this article
Despite the belief at many tech companies that training on copyrighted material is fair use, there’s a major rush among tech companies to acquire all types of data to train AI models. There is a particular need for high-quality content, including material that’s not already available on the internet to be scraped and that could come from licensing publisher archives.
In the following table, VIP+ breaks down all the confirmed content licensing deals between tech companies and publishers for data used to train AI models. It provides all the publicly announced or reported details on the specific data publishers have licensed, deal values and other types of value exchanged in agreements.
In addition to payment, deal terms commonly also include other forms of value exchange, such as giving publishers privileged access to tools or developer teams to help publishers create new AI-powered products. This suggests publishers that engage in licensing tend to be interested in actually using the tools they contribute their content to build.
AI Companies
OpenAI has been the most prolific licensor, securing content and product partnerships with several major publishers since launching ChatGPT and DALL-E in the fall of 2022 that kicked off the emerging market for AI training data.
Other confirmed data licensors include Meta, Google, Runway and, more recently, Reka AI and Picsart. In addition to disclosed deals with OpenAI and Meta, Apple, Amazon and Google have also reportedly struck nonpublic deals with Shutterstock, per Reuters.
Among Big Tech firms, deal talks have also been reported between Apple and various publishers as well as Google-owned YouTube to license songs for AI training from Sony Music Entertainment (SME), Warner Music Group (WMG) and Universal Music Group (UMG). Licensing talks have also reportedly occurred for other large-scale holders of video clips, such as Photobucket, per Reuters in April.
RELATED: Training AI Models With TV & Film Content — How the Licensing Deals Look
Publishers
Publishers that have engaged in licensing predominantly include news publishers, such as The Associated Press, Axel Springer, News Corp, Vox Media, Dotdash Meredith and The Atlantic. Stock content companies Getty Images and Shutterstock have also been recurring licensees, striking deals with multiple tech companies.
Community-based messaging platforms round out the confirmed licensees, including social platform Reddit and developer forum Stack Overflow, where engaged users and developers share knowledge on specific topics.
Licensed Data
Text has been by far the most common data type to be licensed. Confirmed dealmaking for other types of data, whether images, video or music, has primarily come from stock content companies Shutterstock and Getty Images.
Known licensing activity has been particularly scant for music and video content. Publicly acknowledged deals have been nonexistent for premium video content, such as films or TV shows, though Bloomberg reported in May that Alphabet, Meta and OpenAI were engaging in talks with Hollywood studios to license shows for their video generation models. Warner Bros. Discovery had expressed interest in licensing some of its shows, while Walt Disney Co. and Netflix showed interest in other types of collaborations but weren’t willing to license their content.
Training an AI model on high-quality synthetic (AI-generated) data is gaining confidence for some modalities, particularly text. The threat of the phenomenon of “model collapse” — when models degrade if recursively trained on synthetic data — has been debated over time in the AI research community. Yet a recent whitepaper published in the research journal Nature offered new analysis warning the phenomenon indeed occurs.
Additionally, synthetic data may not work equally well for video model training because outputs of many of today’s video models still contain artifacts (errors or inconsistencies) that don’t perfectly simulate 3D reality and that would degrade a model’s general ability to understand how the real world operates. That gap suggests a particular opportunity for licensing high-quality video content.
RELATED: Content Owner Lawsuits Against AI Companies: Complete Index
Licensing video data for AI model training could be a substantial financial opportunity for film and TV producers, particularly as AI companies race to improve video generation models capable of producing sophisticated outputs in an increasingly competitive field.
Recent weeks have seen strong rival entrants in the wake of Sora and Runway’s Gen-3, including Kling and Luma AI’s Dream Machine. There are many more video models in development, numbering as many as 65, a source told VIP+.