Brickroad
News and Media Data Deals
Associated Press and OpenAI (July 2023): The Associated Press (AP) agreed to license its entire news story archive (dating back to 1985) to OpenAI. Financial terms were not disclosed, but AP received access to OpenAI’s technology and product expertise as part of the deal. The license is non-exclusive and focused on using AP text for training OpenAI’s models (not for direct content display).
Industry: Digital/print media. Data: News articles (text).
Axel Springer and OpenAI (Dec 2023): Axel Springer (owner of Politico, Bild, Business Insider, etc.) announced a multi-year content licensing deal with OpenAI for its publications’ archives. In return, OpenAI reportedly gives Axel Springer content a “favorable position” in ChatGPT’s search results. Terms (likely non-exclusive) and payment were not publicly disclosed.
Industry: Digital/print media. Data: News articles.
Le Monde and OpenAI (Mar 2024): France’s Le Monde signed a multi-year agreement to license its news content to OpenAI. OpenAI can train on Le Monde’s full text corpus, and ChatGPT responses referencing Le Monde articles will include the publication’s logo, titles, and hyperlinks to the source.
Financial terms weren’t disclosed.
Industry: Digital/print media. Data: News articles.
Prisa (El País, etc.) and OpenAI (Mar 2024): OpenAI partnered with Spain’s Prisa Media, owner of El País, AS, Cinco Días, and others. This enables ChatGPT to train on and surface Prisa’s Spanish-language news content with attribution. Terms were not public, but OpenAI stated such publisher deals help provide high-quality, real-time information in multiple languages.
Industry: Digital/print media. Data: News articles.
Financial Times and OpenAI (Apr 29, 2024): OpenAI signed a licensing deal with the Financial Times to access its archives of news and feature articles. The FT’s content will be used as training data for ChatGPT, and any FT article used as a source will be linked in ChatGPT’s answers. The deal is multi-year; financial details weren’t disclosed.
Industry: Digital/print media. Data: News articles.
News Corp (Wall Street Journal, etc.) and OpenAI (May 2024): OpenAI reached a five-year content deal with News Corp (owner of The Wall Street Journal, New York Post, and others). This exclusive deal allows OpenAI to use News Corp’s journalism in its models, and it was reported to be valued at over $250 million over the 5-year term.
Industry: Digital/print media. Data: News articles.
The Atlantic and OpenAI (May 2024): The Atlantic entered a strategic multi-year content and product partnership with OpenAI. OpenAI gains access to The Atlantic’s archives to train its models and feature Atlantic content in ChatGPT, with attribution and links back to the site. In exchange, The Atlantic’s team gets early access to OpenAI’s technology to develop AI-driven products. Terms were not disclosed.
Industry: Digital/print media. Data: Magazine articles.
Thomson Reuters and Undisclosed (2023): Reuters’ parent company (Thomson Reuters) struck deals to license its news and possibly financial data for generative AI training. In May 2024, Thomson Reuters reported a 21% year-over-year revenue increase primarily due to “generative AI–related content licensing” (May 2024 6-K). This likely refers to agreements supplying news feeds (and perhaps legal or financial datasets via its Westlaw and Refinitiv units) to AI developers. While specific clients weren’t named in the filing, OpenAI and Microsoft are logical partners – Reuters content now appears in Microsoft’s Bing Chat and other AI systems. The deals are presumably multi-year and non-exclusive. Industry: News & financial data. Data: Real-time newswire text (and potentially financial market data or legal text).
Vox Media and OpenAI (May 2024): Vox Media (owner of The Verge, New York Magazine, Vox.com, etc.) signed a content-and-product partnership with OpenAI. OpenAI can train on Vox Media’s archives and integrate that content into ChatGPT, while Vox will leverage OpenAI’s tech to build new AI-powered products for its audiences. No public financial details.
Industry: Digital/print media. Data: News and online magazine articles.
Condé Nast and OpenAI (Aug 2024): OpenAI announced a multi-year deal with Condé Nast to license content from its flagship brands (The New Yorker, Vogue, GQ, Wired, etc.). OpenAI’s products (ChatGPT and the prototype “SearchGPT”) will be able to display and use Condé Nast articles for training and answering queries, with appropriate attribution. Financial terms were not disclosed.
Industry: Digital/print media. Data: Magazine and web articles.
Time Magazine and OpenAI (2023–2024): Time was also among the publishers OpenAI partnered with (deal not publicly announced until revealed in 2024). The arrangement grants OpenAI access to Time’s content archive for model training.
Industry: Digital/print media. Data: Magazine articles.
Axios and OpenAI (Jan 2025): Axios entered a three-year licensing and technology partnership with OpenAI. OpenAI is providing funding that Axios will use to launch four new local news bureaus, and all Axios journalists get access to OpenAI’s enterprise tools. In return, Axios’s journalism will be used to train OpenAI’s models and will appear in ChatGPT with attribution and links. Financial specifics were not disclosed, but the commitment suggests a multi-million dollar investment in Axios.
Industry: Digital/print media. Data: News articles.
The Guardian and OpenAI (2024): The Guardian (UK) has reportedly signed a deal with OpenAI (exact date undisclosed) to license its news content for AI training. The Guardian’s deal is one of several between publishers and OpenAI who are opting for partnership instead of pursuing legal action. No public details on term or payment.
Industry: Digital/print media. Data: News articles.
Schibsted and OpenAI (2024): Schibsted, a major Norwegian media group (publisher of Aftenposten, VG, etc.), also inked a content licensing agreement with OpenAI. Financial details were not publicly announced, but the deal presumably gives OpenAI multi-year access to Schibsted’s news archives for model training.
Industry: Digital/print media. Data: News articles (multiple languages).
Future plc and OpenAI (2024): Future plc (UK-based publisher of tech, gaming, and specialty magazines/websites) signed a deal with OpenAI to license its content. No specifics were released, but it likely involves OpenAI using Future’s articles (from sites like TechRadar, PC Gamer, etc.) as part of training data.
Industry: Digital/print media. Data: Online media articles.
Hearst Magazines and OpenAI (2024): Hearst’s magazine division (Cosmopolitan, Esquire, Popular Mechanics, etc.) reportedly partnered with OpenAI to license content. Terms were undisclosed, but Hearst’s president indicated that the partnership will “enhance both companies’ offerings,” suggesting the arrangement may involve technology exchange in addition to licensing.
Industry: Digital/print media. Data: Magazine articles.
Dotdash Meredith and OpenAI (late 2024): Digital publisher Dotdash Meredith (owner of People, Better Homes & Gardens, etc.) is also listed among OpenAI’s content partners. This likely allows OpenAI to train on Dotdash Meredith’s lifestyle and news content archives. Not publicly announced in detail; presumed multi-year with undisclosed payment. Industry: Digital/print media. Data: Digital magazine articles.
Google and Associated Press (Jan 2025): In Google’s first deal with a news publisher specifically for AI, it signed an agreement with AP to provide a real-time news feed for Google’s generative AI (the Gemini chatbot). AP will supply up-to-date newswire content to help train and improve Gemini’s answers, ensuring timely and accurate information. Google didn’t disclose payment details. The deal builds on a longstanding Google–AP relationship in news licensing.
Industry: Digital/print media. Data: Real-time newswire articles.
Mistral AI and Agence France-Presse (Jan 2025): Paris-based startup Mistral AI struck a multi-year content partnership with AFP (Agence France-Presse). AFP will provide all of its text news stories (about 2,300 per day, in multiple languages) to Mistral’s new chatbot “Le Chat” for training and for real-time Q&A answers. The goal is to improve the factual accuracy of Mistral’s French-language model with trusted news data. Financial terms and contract length were not disclosed. Industry: Digital/print media. Data: Newswire articles (multilingual).
New York Times and Amazon (May 2025): The New York Times agreed to license its journalism to Amazon, marking the publisher’s first licensing deal explicitly centered on generative AI. The multi-year agreement lets Amazon use NYT content – including news articles, NYT Cooking recipes, and sports coverage from The Athletic – in its AI products and services (e.g. Alexa and other AI models) This will involve real-time display of story summaries and short excerpts in Amazon’s products and the use of NYT content to train Amazon’s proprietary foundation models. Financial terms were not disclosed, but NYT stated the deal aligns with its principle that original reporting should be paid for.
Industry: Digital/print media. Data: News articles, recipes, and sports journalism.
Reuters – Meta (October 2024): Thomson Reuters (Reuters News) confirmed it has licensed news content to train LLMs. In its May 2024 6-K, Reuters said generative-AI related content licensing drove a 15% revenue increase in early 2024. Separately, Meta revealed a partnership whereby its AI assistant can provide news answers with summaries and links from Reuters content. Details of the Meta deal (likely 2023) were not publicly disclosed, but it is a non-exclusive license for news data.
Industry: Digital and print media.
Perplexity AI – Multiple News Publishers (2023): Perplexity, an AI search chatbot, launched a revenue-sharing program with publishers in 2023. Participants include The Independent (UK), Los Angeles Times, Lee Enterprises (US local newspapers), Adweek, World History Encyclopedia, Texas Tribune, Fortune, Time, Der Spiegel and others. These publishers license their content to Perplexity, which uses it to answer user queries with citations. In return, the publishers will share in advertising revenue when their content is shown, get visibility into how their content is used, and receive other benefits. Most of these deals are non-exclusive and involve no upfront payment but a revenue split model.
Industry: Digital and print media.
Prorata.ai – Multiple (500+) News Publishers (2024): Prorata, an AI attribution platform, secured licensing deals to train on content from publishers like DMG Media (Daily Mail), The Guardian, Sky News, Prospect magazine, Fortune, Axel Springer, Financial Times, and The Atlantic. These agreements, mostly in 2024, allow Prorata’s AI to use the publishers’ archives with permission. Details on each deal’s value are not public; they appear to be partnership licenses similar to Perplexity’s (possibly with revenue-sharing or subscription fees), and rumors point to a 50/50 revenue split with content licensing partners based on usage.
Industry: News/media. Data: News articles and magazine content.
Editor's Note: OpenAI has disclosed that by late 2024 it had partnered with at least 20 news organizations globally for licensed training data (Reuters). Its offers to news publishers have been reported at $1–5 million per year for content access (The Verge), though marquee deals like News Corp’s far exceed that. Meanwhile, some major publishers (e.g. NY Times) have chosen litigation over licensing, underscoring the range of responses to AI firms’ approaches (AP News) (Press Gazette).
Social Media and Community Data Deals
Reddit and Google (2023): Reddit, a platform hosting vast user discussions, signed data licensing deals with several companies in 2023 – most notably a lucrative agreement with Google – to provide API access to Reddit’s historical and ongoing content. Google’s deal alone was reported at about $60 million per year. Reddit’s IPO filing (S-1) later revealed about $66.4M in licensing revenue (2021–2022), underscoring how selling data to AI firms (like this Google deal) became a significant new stream. In its October 30, 2023 10-K filing, Reddit reported that the aggregate value of its long-term contracts (primarily data licensing deals exceeding one year) is $294.8 million. The company noted it is “in the early stages” of its data licensing efforts, suggesting significant growth potential ahead.
Industry: Social media. Data: User-generated text (posts, comments).
X (formerly Twitter) – various licensees (2023): X offers Firehose API access, which provides a real-time feed of all public tweets for AI training and analytics. In 2023, X under Elon Musk set a high price on this data: $42,000 per month (roughly $0.5M/year) for limited-access or about $2.5 million per year for the full firehose. Enterprise clients (including AI model developers and financial firms) can pay for this firehose to train language models on social media data. Known clients include Bloomberg, which licensed Twitter data in 2023 to build a finance sentiment model, and IBM, which also reportedly licensed Twitter’s dataset in 2023 to train business AI models as an alternative to scraping (AP). Academic researchers previously accessed a “decahose” 10% tweet stream under earlier Twitter programs, but those programs ended in 2023 when API terms were tightened.
Industry: Social media. Data: Social network posts (tweets).
Stack Overflow and Google (Mar 2024): Stack Overflow (SO), a popular Q&A platform for developers, entered a strategic partnership with Google Cloud to integrate Google’s Gemini AI with SO’s knowledge base The deal lets Google train its Gemini large model on 15+ years of SO’s question-and-answer data, aiming to enhance coding solutions in Google’s AI products. In turn, Stack Overflow gains by embedding Google’s AI into its site and tools for developers. Terms were not disclosed (this appears to be a product partnership rather than a direct data sale), and the data use is almost certainly non-exclusive. A similar deal was reportedly inked with OpenAI around the same time, which generated substantial backlash among the SO community. Many volunteer contributors objected and even took steps to “poison” the dataset in protest, reflecting tension in community-sourced platforms. In June 2025, Stack Overflow announced that its entire network (150+ sites of Q&A) is now available via the Snowflake Marketplace.
Industry: Online developer community. Data: Crowdsourced programming Q&A content.
Yelp and Perplexity AI (Mar 2024): Yelp licensed its database of business reviews and ratings to Perplexity AI’s search chatbot. This allows Perplexity to provide answers about local businesses (restaurants, services, etc.) with Yelp-sourced information. The search engine now cites multiple Yelp reviews and displays maps in its responses. Importantly, Perplexity stated is will not use Yelp data to train its own base models, only to provide grounded answers. It’s unclear if this 2024 arrangement introduced new terms beyond Yelp’s existing API programs, but Yelp’s 2023 10-K highlights that its “Other” revenue grew from $21M in 2020 to $47M in 2023 – possibly due in part to generative AI licensing (a ~$25M jump in that segment).
Industry: Local reviews platform. Data: User reviews and business listing info.
LinkedIn (Microsoft) and OpenAI (2023, internal): Microsoft-owned LinkedIn did not announce a public licensing deal with OpenAI, but as part of Microsoft’s broader partnership with OpenAI, LinkedIn’s vast trove of professional data (public profiles, job postings, etc.) has been used to train certain enterprise GPT models. Since Microsoft already had rights to LinkedIn data via acquisition, this was an internal data sharing rather than an external license. No press releases were issued; the arrangement is inferred from Microsoft’s AI use cases and product integrations.
Industry: Professional social network. Data: User profiles, employment data.
Discord (no formal deal, 2023 policy change): Social platforms like Discord have not publicly licensed data to AI firms, but OpenAI’s model training is known to have included some public Discord forum data (via web scraping). In response, in 2023 Discord updated its policies to allow server administrators and users to opt out of data scraping for AI purposes. (No specific licensing agreements were announced; this highlights how some community platforms are reacting to unlicensed AI data usage.)
Editor's Note: The landscape for social/community data licensing has shifted dramatically over the past several years. A few large platforms (Reddit, Twitter/X, Stack Overflow, Yelp) moved from open or free data models to a paid licensing model, signing deals with tech companies building AI models. Non-English and niche platforms have tended to keep data in-house to develop their own AI solutions (Meta, Naver, Kakao, Weibo, etc.), at least for now. We’ve also seen first-of-their-kind agreements to license specific types of community content – code (GitHub via Microsoft), images (Shutterstock via multiple deals), reviews (Yelp), and Q&A text (Stack Overflow). These deals are fueling improvements in AI chatbots, search engines, recommendation systems, and content generators by injecting high-quality, human-created data.
Equally important are the failures and pushbacks: when licensing talks or usages have not respected stakeholders, they’ve led to conflict. As we head into the latter half of 2025, the trend is toward more structured partnerships (with clear terms and user opt-outs) rather than indiscriminate scraping. The industry appears to be feeling its way to an equilibrium where community data is fairly remunerated and ethically used for AI, as exemplified by deals like Reddit-Google and OpenAI’s pacts with news publishers. Each new deal sets precedents on price, attribution, and consent that will shape all future negotiations between content platforms and AI developers.
Visual Media and Creative Content Deals
Shutterstock and OpenAI (2021 & July 2023): Shutterstock, a stock image and footage provider, first partnered with OpenAI in 2021 to provide a library of images for training DALL·E. In July 2023, they expanded this into a six-year deal giving OpenAI a broad license to Shutterstock’s entire catalog of images, videos, and music clips for model training. In exchange, Shutterstock integrates OpenAI’s generative AI (like DALL·E) into its platform and launched a Contributor Fund to compensate artists whose works are used to train the AI. Shutterstock’s CFO stated the initial deals with big tech firms ranged from $25–50 million each, and most were later expanded. This implies OpenAI (and other partners, below) paid in that range.
Industry: Stock media. Data: Stock images, videos, music.
Shutterstock and Google, Meta, Amazon, Apple (late 2022 – 2023): In the wake of ChatGPT’s debut, Shutterstock struck licensing agreements with Google, Meta, Amazon, and Apple to supply hundreds of millions of its stock images, videos, and audio clips for AI training. Each of these deals was reportedly in the $25–50 million range initially, with some later expanded (Apple’s deal, previously undisclosed, was revealed by Reuters). These are non-exclusive multi-year licenses for each tech giant’s internal model training.
Industry: Stock media. Data: Images, videos, audio from Shutterstock’s library.
Shutterstock and NVIDIA, LG, etc. (2023): Shutterstock also entered collaborations with NVIDIA, LG, and other firms to develop generative AI tools, likely involving licensing portions of Shutterstock’s 3D asset library and image data. Specific terms weren’t given, but these partnerships indicate Shutterstock leveraging its content across the AI industry.
Industry: Stock media / AI tech. Data: Images and possibly 3D models from Shutterstock.
Getty Images and Stability AI (lawsuit, 2023): Getty Images – another stock image leader – did not license its content to Stability AI’s Stable Diffusion; instead, Getty sued Stability in early 2023 for allegedly scraping 12 million Getty photos without permission. As of 2024, that litigation is ongoing (in parallel with similar suits by artists). However, Getty’s earnings reports later revealed it has quietly engaged in some data licensing deals of its own. In the first half of 2024, Getty reported $3.2 million (2.6% of revenue) in its “Other” segment, primarily from data licensing. Getty’s CEO Craig Peters noted a couple of small AI licensing deals in Q2 and Q3 2024 with a “longstanding partner,” structured in line with Getty’s and its creators’ long-term interests. (He also said Getty passed on many deals that didn’t meet its ethical and financial criteria.) Those partner(s) weren’t named, but one known collaboration is Getty Images and NVIDIA (2023): Getty partnered with NVIDIA to develop Generative AI by Getty Images, a tool using Getty’s fully licensed content to train an image generation model. This NVIDIA partnership likely involved a licensing deal (terms undisclosed) where Getty’s library was used to train a custom generative model offered to Getty’s clients.
Industry: Stock media. Data: Professional photos and illustrations from Getty.
Getty Images–Shutterstock Merger (Jan 2025): As of January 2025, Getty Images and Shutterstock announced an agreement to merge into a single company. This consolidation, uniting two of the largest commercial image libraries, may affect future data licensing strategies since the combined entity controls a huge portion of the market’s premium imagery. (Regulatory approval of the merger was pending as of June 2025, although Shutterstock received stockholder approval on June 10, 2025.)
Freepik/EyeEm and Unnamed Tech Giants (2023): Freepik, which had acquired the photo platform EyeEm, licensed the majority of its 200 million image archive to two large tech companies in 2023. The pricing was stated at $0.02–$0.04 per image, implying each deal is worth on the order of $4–8 million. At least two such deals (roughly $6M each) were signed, and Freepik’s CEO noted five more similar deals were in the pipeline. The buyers were not identified, but presumably are companies developing image-generative AI. Licenses are likely non-exclusive bulk data sales.
Industry: Stock photos. Data: Photographic images (stock and user-contributed).
Photobucket (in negotiations, 2024): Photobucket, an image hosting site, revealed in 2024 that it was in talks with multiple tech companies to license its enormous library of 13 billion user images and videos. CEO Ted Leonard discussed proposed rates between $0.05 and $1.00 per photo (and >$1 per video), indicating that even at the low end, a full library deal could exceed $650 million. More likely, deals for subsets or specific volumes were being considered. As of mid-2024, no specific deal had been publicly closed. Photobucket’s archive, significantly larger than Shutterstock’s, is highly appealing for foundational vision model training – though it may be hampered by the need to filter and annotate user-generated content at that scale. (Update: As of late 2024, Photobucket’s licensing plans prompted legal challenges – a class-action lawsuit filed in Dec 2024 alleges the company sought to sell users’ photos for AI without proper consent. The lawsuit aims to block Photobucket from proceeding with any data sale until privacy and permission issues are resolved. No licensing deal has been executed to date.)
Industry: Photo storage/hosting. Data: User-uploaded photos and videos.
Adobe and NVIDIA (2023): Adobe, instead of licensing its content to external AI developers, trained its own Firefly generative AI on Adobe’s stock image library (with contributors’ permission). In 2023, Adobe partnered with NVIDIA to integrate Firefly into NVIDIA’s Picasso cloud AI service. This partnership can be seen as Adobe effectively licensing its curated dataset (Adobe Stock images and other Adobe-owned assets) via a collaboration with NVIDIA. Terms weren’t disclosed, as it’s more of a product integration than a raw data sale. Industry: Stock media / Creative software. Data: Stock images (licensed from Adobe’s contributors).
Adobe and Black Forest Labs, Runway, Google Cloud, and Fal (2025): A notable recent partnership bridging content creators and AI labs is Adobe’s deal with Black Forest Labs (BFL) and others. Black Forest Labs is a 2024-founded generative AI startup led by researchers who pioneered Stable Diffusion and related models. In 2025, Adobe announced it would integrate BFL’s “Flux” image generation models directly into Adobe’s Creative Cloud ecosystem. This partnership does not involve licensing Adobe’s content to BFL, but rather the opposite: Adobe is licensing BFL’s AI model as a third-party offering within its products. Under the arrangement, Black Forest Labs gets distribution access to Adobe’s massive user base, while Adobe expands its AI toolset beyond its in-house Firefly models. The nature of the deal is a strategic partnership/integration: BFL’s Flux 1.1 Pro and others (including Runway's "Framers" model) is offered in Adobe apps (e.g. Project Concept, Adobe Express) alongside Adobe’s own models. This gives creators a choice of generative styles, and Adobe ensures any content generated with these models in its tools is clearly labeled and not used for further training without consent. This “open ecosystem” approach reflects a broader trend of media companies and AI startups forming symbiotic relationships: rather than just exchanging data, they are combining strengths (content, user experience on one side; algorithmic innovation on the other).
DataSeedsAI (Zedge) (2025): Zedge announced in June 2025 a strategic expansion into the AI content licensing market through the launch of DataSeeds.AI, a B2B marketplace for AI training datasets. The platform leverages Zedge's extensive creator community and catalog of over 30 million images to provide high-quality, rights-cleared content for AI model training. DataSeeds.AI has already secured its first partnership with an unnamed leading AI company, validating its market potential. The platform aims to serve enterprises needing large-scale, diverse datasets while creating new revenue streams for Zedge's content creators.
Symphonic Distribution and Musical AI (Aug 2024): In the music sector, Symphonic Distribution (an independent music distributor) partnered with a startup called Musical AI to license its entire music catalog for AI training. Through this opt-in program, Symphonic’s indie artists and labels can agree to have their songs used to train generative AI music models in exchange for compensation. This deal marked one of the first licensed music datasets for AI, aiming to be a “properly-licensed” alternative to the rampant unauthorized scraping of music. Musical AI (which raised seed funding in 2023) will train models on the opted-in tracks to create AI-generated music. Financial details weren’t public; the structure is likely a revenue share or per-track royalty paid to rights holders.
Industry: Music. Data: Audio recordings (songs).
Universal Music Group and Google (exploratory, 2023–2024): In mid-2023, UMG and Google confirmed they were exploring licensing agreements for artists’ voices and music to train AI that can generate “deepfake” songs legally. As of 2025, no finalized deal had been announced, but the talks implied a framework where major artists (and their labels) would grant permission for their vocal likenesses to be used in AI-generated music, in exchange for compensation. This highlights a growing trend toward licensing in the music industry to preempt copyright and publicity-rights issues, although concrete agreements were still in development.
Industry: Music. Data: Audio recordings (songs)
Tumblr, Wordpress and OpenAI/Midjourney: Tumblr and Wordpress made headlines in early 2024 for pursuing content deals. Leaked internal documents showed Tumblr preparing to sell vast amounts of user data to OpenAI and Midjourney. The exact data types weren’t fully detailed, but it likely includes blog posts and images from public Tumblr blogs (2014–2023) for use in training AI image and text models. Notably, an engineer’s memo indicated Tumblr accidentally compiled essentially all public (and some private) posts in a data dump for the AI firms – highlighting the enormity of data on the table. The deal was described as “imminent” as of Feb 2024. Tumblr planned to introduce an opt-out for users, but the leak suggested much data had already been packaged for sale.
Editor's Note: Over the past three years, visual media and creative content deals have transitioned from covert scraping to a wave of formal licensing arrangements between content platforms and AI developers. The available deal data points to a rapidly maturing market where high-quality, rights-cleared assets—ranging from premium stock libraries to massive archives of user-generated content—are becoming core infrastructure for model development.
Platforms like Shutterstock, Zedge, and Adobe now act not just as content sources but as strategic partners, offering full-stack solutions that combine datasets, distribution channels, and model integrations. Meanwhile, legacy players such as Photobucket and Freepik are repositioning themselves as data brokers, aiming to monetize dormant libraries through AI licensing deals.
Yet this shift has not been without friction. Getty Images and DeviantArt have taken legal action in response to unauthorized model training. Others, including Tumblr and Photobucket, are under growing scrutiny for how they structure user consent and manage privacy concerns in their data monetization efforts.
Transparency remains the market’s most glaring shortfall. Although some price points have surfaced—ranging from pennies per image to tens of millions per license—most agreements are locked behind NDAs or buried within vague revenue line items. This lack of clarity makes it difficult for stakeholders to assess the true value of these datasets or understand how and where they are being deployed.
As foundational models mature and the AI race shifts from scale to specificity, content owners will hold greater leverage. Licensing terms will tighten. Prices will rise. And what began as a gray market for scraped content will continue its evolution into a structured, rights-aware economy for training data. This section captures that turning point.
Academic and Book Publisher Deals
Taylor & Francis (Informa) and Microsoft (July 2023): Academic publisher Taylor & Francis (part of Informa PLC) revealed it made a deal to sell access to its research publications for AI training. One confirmed partnership was with Microsoft in 2023, reportedly worth $10 million. Informa’s mid-2023 financial report later indicated that T&F was expected to earn £58M (~$75M) in 2023 from licensing content to AI firms. This suggests multiple deals: the $10M Microsoft deal, plus at least one other major AI customer (undisclosed), and additional pipeline deals. These agreements likely grant non-exclusive access to large swaths of academic journal articles and ebooks for training LLMs, with provisions to protect authors’ rights (e.g. no verbatim reproduction in AI outputs).
Industry: Academic publishing. Data: Scholarly journals and textbooks.
Wiley – multiple AI companies (2023–2024): John Wiley & Sons confirmed it entered at least two agreements to license its published content for AI training. By early 2024, Wiley anticipated $44 million in revenue from these AI licensing deals. (The company did not name the partners, but industry observers speculated they were large tech or AI firms. Wiley emphasized that author compensation and copyright protection were part of these arrangements.) In a public statement, Wiley condemned illegal scraping and positioned its licensed, structured content as a high-quality alternative.
Industry: Academic & professional publishing. Data: Research articles, academic books, databases.
Oxford University Press – “AI partnerships” (2024): OUP acknowledged in mid-2024 that it is “actively working with companies developing LLMs” to license content for the responsible development of AI. No specific deal was named, but OUP’s efforts likely mirror its peers. (Oxford’s vast catalog, including academic journals and the Oxford English Dictionary, is presumably in scope for these discussions.)
Industry: Academic publishing. Data: Journals, reference works, dictionaries.
Cambridge University Press – opt-in initiative (2024): Cambridge UP had not publicly confirmed a signed deal as of 2024, but it announced an opt-in system for authors regarding AI uses of their work. This indicates CUP is preparing or negotiating licensing arrangements and wants to secure author permissions proactively. AAt the time, press reached out to ~20,000 authors for consent to include their works in potential AI training licenses, noting that only a “small minority” declined to participate. This effort suggests CUP is ironing out “fair remuneration” models for contributors and aims to ensure author goodwill as it pursues deals.
Industry: Academic publishing. Data: Academic journals and books (opt-in content).
Elsevier – exploratory talks (2023–2024): Elsevier (RELX Group) was rumored to be in talks with AI firms given its valuable scientific content. As of 2024, no specific deals have been disclosed. (RELX’s LexisNexis division, however, is known to license legal text data for AI – for instance, to startups training legal AI models – but those deals are typically private and beyond scholarly publishing.)
HarperCollins – “large tech company” (2024): HarperCollins, a major book publisher, made a deal with an unnamed AI technology company in 2024 to license a selection of its books for training AI models. Authors were given the choice to opt in their works to this program or decline. Leaked details from literary agents indicated HarperCollins would pay authors $2,500 per book title included in the training dataset The AI partner was not officially named; industry insiders speculated it could be OpenAI or possibly Google. The contract length and usage terms were not public, but presumably the AI firm gets a non-exclusive license to the full text of those books for model training, and participating authors receive a one-time fee (in addition to any royalties).
Industry: Book publishing. Data: Full-text books (opt-in titles).
American Association for the Advancement of Science (AAAS) and ProPublica/Prorata (Jan 2025): AAAS (publisher of the Science family of journals) launched a pilot licensing deal with the ProRata platform in early 2025. The pilot allows select content – peer-reviewed articles from the open-access Science Advances journal and news stories from Science magazine’s team – to be used in ProRata’s AI-driven search engine Gist. Unlike broad LLM training, this use case focuses on enhancing search and transparency for scientific information. AAAS’s deal emphasizes attribution and high-quality, current content rather than mass data ingestion. Terms were not detailed publicly, but in 2024 ProRata was offering a 50/50 revenue split to content partners based on usage.
Industry: Scientific publishing. Data: Research articles and science news.
NEJM Group and OpenEvidence (Feb 2025): The publisher of the New England Journal of Medicine (Massachusetts Medical Society) announced a multi-year licensing agreement with OpenEvidence in 2025. OpenEvidence is a platform that provides clinicians with up-to-date medical research via an AI-assisted evidence search tool. Under the deal, OpenEvidence’s retrieval-augmented generation (RAG) system can use NEJM Group content (NEJM and related medical journals, dating back to 1990) to inform its answers for doctors. Financial terms were not disclosed.
Industry: Medical publishing. Data: Medical journal articles (clinical research, reviews).
Wiley and Perplexity AI (May 2025): Wiley announced a partnership with Perplexity AI to integrate Wiley’s authoritative academic content into Perplexity’s generative AI search tool. Wiley became Perplexity’s first education-sector content partner. The deal allows users at institutions that subscribe to Wiley’s digital libraries (in fields like nursing, business, engineering, etc.) to use Perplexity’s AI assistant to query Wiley materials with proper citations. In practice, students and researchers using Perplexity Enterprise gain new “AI search” pathways into the scholarly content they already have access to, combining AI’s Q&A capabilities with verified Wiley sources. This partnership is less about a bulk data sale and more about recurring content integration – Wiley maintains its subscription revenue (since only paid collections are accessible) while enhancing the value of its content with AI functionality. Financial details were not disclosed, but by mid-2025 Wiley reported over $50M in cumulative AI licensing revenue from its various deals (bloomberg).
Industry: Academic publishing / EdTech. Data: Educational collections and e-textbooks (licensed for AI search with attribution).
Wiley and Amazon Web Services (May 2025): Wiley also collaborated with AWS to launch a generative AI literature search agent for life sciences, unveiled at AWS’s May 2025 Life Sciences Symposium. The agent, built on AWS’s Bedrock platform, enables researchers to perform full-text searches across Wiley’s extensive journal content (not just abstracts) using AI. This is the first AI agent of its kind from a publisher on AWS. Essentially, AWS customers in healthcare and biotech can query Wiley’s corpus (e.g. clinical trial methods, results sections) via an AI assistant, bringing more detailed scientific content into AI-driven workflows. Financial terms were not disclosed; Wiley’s Q1 2025 earnings call described its AI licensing strategy in two stages: first, one-time deals for training major models; second, ongoing licensing embedded in products like this AWS agent as AI applications roll out.
Industry: Academic publishing / Cloud AI. Data: Full-text scientific journal articles.
Editor's Note: Smaller academic publishers are also moving quietly into AI deals. For example, Taylor & Francis’s parent Informa has licensed not only journals but also corporate report content; Springer Nature and IEEE are likewise exploring AI partnerships. By 2025, the trend is clear: publishers can profit handsomely by monetizing their archives for AI training, and many have struck deals (often quietly) – sometimes to the indignation of authors and editors who raise ethical concerns. The balance between protecting intellectual property and enabling innovation is still being negotiated across the industry.
Other Enterprise Data Licensing Deals
Bloomberg – internal AI (2022–2023): Bloomberg LP, instead of licensing its data externally, used its proprietary troves of financial news and market data to train Bloomberg GPT, a 50-billion parameter finance-focused language model launched in 2023. This was an internal “deal” – Bloomberg as both data owner and model builder – so no external contract or revenue exchange occurred. However, it underscores the value of exclusive data: Bloomberg’s approach (building its own LLM on its unique dataset) contrasts with publishers that chose to sell access to others.
Industry: Financial information. Data: Financial news, stock data, proprietary datasets (used in-house).
IBM – multiple AI firms (2023–2024): IBM possesses vast enterprise datasets (e.g. The Weather Company’s historical weather data, IBM’s code repositories, customer support logs, etc.). In its Q2 2024 earnings call, IBM disclosed multi-year licensing agreements to supply certain enterprise data to several AI companies, which significantly boosted its “data and AI” revenue (IBM 2Q 2024 Earnings). Details are scant due to confidentiality, but these likely include licensing proprietary datasets – for example, weather data, industry research, or software knowledge bases – to model developers. IBM’s CEO mentioned the company would pursue deals to provide data in domains like customer service and IT operations to AI startups.
Industry: IT & enterprise analytics. Data: Various (e.g. weather records, technical documentation, business process data).
C3.ai – industry data partnerships (2024): C3.ai, an enterprise AI software provider, reported new data licensing deals in its FY2024 Q3 earnings. These deals contributed to an 18% YoY revenue growth. C3.ai works with industries like energy, defense, and manufacturing; one known partner is (providing oil industry data in exchange for equity and royalties in past arrangements).
Industry: Enterprise AI software. Data: Industrial IoT and enterprise operations data.
NVIDIA – “NeMo” toolkit data (2023): NVIDIA, while known for hardware, also curates datasets for its NeMo™ large language model services. In late 2023, NVIDIA reported a boost in data center revenue partly due to licensing agreements for its “AI-ready” training data via the NVIDIA NeMo Retriever product. This indicates NVIDIA packaged certain valuable datasets (perhaps code, or synthetic data, or scraped and cleaned text corpora) and licensed them to enterprise customers building models, contributing to record data center sales. (These could be bundled deals of NVIDIA hardware + data.) Notably, subsequent NVIDIA reports did not break out this data-licensing contribution and seems less inclined over the past recenty months to make NeMo into a consumer product - instead focusing on its core GPU business
Industry: AI infrastructure. Data: Curated large-scale text and image datasets (proprietary or assembled by NVIDIA).
Planet Labs – multiple clients (2020–2024): Satellite imagery company Planet Labs generates daily Earth images. Its business model is data licensing via subscriptions and API access. By 2024, Planet emphasized (through its 2024 10-Q) delivering “AI-ready” satellite datasets, leveraging its archive for machine learning applications. Clients include agriculture, government, and climate AI projects. While not a single deal, Planet’s revenue (over $180M in 2024) largely comes from multi-year data licenses to companies (some likely training geospatial AI on this data).
Industry: Satellite imagery. Data: Earth observation images (high-cadence satellite photos).
SoundHound AI and (Unnamed) Semiconductor Company (2023): Voice AI firm SoundHound disclosed a one-time voice data licensing deal with a U.S. semiconductor company in 2023, yielding $3.6 million in revenue. per its 23'-24' annual report. The deal provided SoundHound’s extensive voice/audio dataset (speech recordings, voice commands, etc.) to the chipmaker, likely to improve on-chip AI models for speech recognition. It was non-recurring and characterized as a licensing of voice data for AI, completed in FY2023. The unnamed chip company could be Qualcomm, NVIDIA, AMD, or Intel, all of whom develop AI speech capabilities.
Industry: Voice AI / Semiconductors. Data: Spoken audio recordings (voice commands, queries).
Tempus – multiple deals (2024): Tempus (NASDAQ: TEM) is a precision medicine tech company with a large library of clinical and molecular data. In its August 2024 8-K it reported 40% YoY growth in data licensing revenue, reaching $3.7 M in the first half of 2024. This implies deals where Tempus licenses anonymized patient data, genomic sequences, or clinical records to pharmaceutical or AI companies for model training (e.g. AI drug discovery or medical LLMs). Tempus’s data licensing is a growing revenue stream as part of its overall 32% revenue growth.
Industry: Healthcare/Biotech. Data: Genomic sequences and clinical patient data (de-identified).
Syndigate – multiple AI companies (2023): Syndigate, a content syndication service aggregating news from the Middle East and globally, began licensing its multilingual news feeds to AI model developers around late 2023. Likely clients are AI firms needing non-English news data to diversify their training corpora.
Industry: News syndication. Data: Multilingual news articles and wire services.
Others: Numerous other data licensing arrangements have been reported across sectors. For example: Wolters Kluwer (a legal and tax information provider) has licensed parts of its content for training legal AI; Relativity (legal e-discovery software) has licensed document corpora to AI startups working on legal reasoning; Experian (a credit bureau) began offering consumer credit data for AI-driven credit scoring models; Meta has paid data brokers for certain niche datasets to supplement publicly available data like Common Crawl; and Figure AI – a humanoid robotics startup – even inked a collaboration with OpenAI in 2024 (to use OpenAI’s language models for its robots’ cognition) which was later terminated in early 2025. Each of these examples reflects the nascent market for high-quality, proprietary data to feed AI systems.
Editor's Note: The deals above range from direct content fees (lump-sum or annual payments) to partnerships with revenue-sharing or technology exchange. Contract lengths are typically multi-year (3–6 years is common) and can be exclusive or non-exclusive. Usage restrictions are often included – for instance, ensuring attribution in AI outputs or preventing the AI from republishing licensed text verbatim without credit. The values of these deals vary widely, from low millions per year for niche archives up to hundreds of millions for major media conglomerates or large image libraries. What is clear is that data has become a critical bargaining chip in the AI economy. Even as legal debates continue over fair use and scraping, many companies are choosing to strike deals and monetize their data assets. The result is a rapidly evolving marketplace for AI training data, one that is defining new norms for how information is valued and exchanged in the generative AI era.