
As part of the rapidly evolving legal landscape surrounding artificial intelligence, we highlight a landmark settlement proposal that, despite being halted by a federal court, reveals critical new standards for AI training data. These developments may impact how your company procures, uses, or develops AI models.
A Quick Recap – AI TRAINING DATA LITIGATION LANDSCAPE
Over the past two years, leading AI companies—including OpenAI, Microsoft, Meta, Stability AI, and Anthropic—have faced a wave of lawsuits from copyright owners, authors, artists, and publishers alleging unauthorized use of copyrighted material to train large language models. The New York Times’ lawsuit against OpenAI and Microsoft seeks billions in damages and the destruction of ChatGPT’s dataset. Meanwhile, authors like Sarah Silverman and Ta-Nehisi Coates have filed similar claims against Meta and others, alleging they “simply stole” copyrighted works to build their billion-dollar AI systems.
A pivotal case in this landscape, Bartz v. Anthropic, was filed in August 2024 in the Northern District of California. Plaintiffs alleged that Anthropic used pirated e-books from shadow libraries such as LibGen to build “central library” datasets and train its “Claude” large language models, raising core questions about how copyright doctrines like fair use apply in the Gen-AI context.
In June 2025, Judge William Alsup issued a landmark summary judgment, ruling that training on lawfully acquired books was “quintessentially transformative” and protected under fair use. However, the court found that downloading, retaining and using pirated copies for Anthropic’s central library—did not qualify for fair use and constituted infringement. In mid-July 2025, the court certified a class action of U.S. copyright holders whose works were downloaded from shadow libraries, setting the stage for exposure that could run into the hundreds of billions of dollars in damages.
The June 2025 ruling in this case assumes liability for using pirated protected content for training even if fair use arguments succeed. This distinction highlights the growing importance of data provenance and the legal risks associated with sourcing, not just using, data.
A LANDMARK SETTLEMENT, ALTHOUGH HALTED BY The COURT, GIVES A GLIMPSE OF THE FUTURE [September 2025]
Facing massive potential liability, in September 2025, Anthropic agreed to a proposed $1.5 billion class-action settlement—the largest copyright settlement in U.S. history. However, Judge Alsup denied preliminary approval of the settlement, saying he felt “misled” by the deal, and that the proposal was “nowhere close to complete”, and could be forced “down the throat of authors.”
Specifically, Judge Alsup noted that the proposed settlement fails to ensure compensation for authors, as it lacks definitive lists of eligible pirated books and authors, and may potentially prevent effective compensation from the entire class due to undefined mechanisms for notification and claims submission. As noted by the court, such a settlement would also fail to provide Anthropic with legal finality, leaving it exposed to future lawsuits for the same infringements.
However, although halted, the proposed settlement underscores a key trend: the market is not waiting for courts or regulators. Companies are using commercial negotiations to bypass the legally fascinating—but commercially uncertain—questions of copyright infringement, and fair use. In other words, both parties of the market – i.e. Copyright owners and LLM developers – prefer to “buy risk” and gain certainty, rather than wait for regulators and Judges, to define their commercial future.
This mirrors historical patterns from previous technological inflection points, where legal uncertainty was ultimately resolved through a combination of exceptions and licensing deals, rather than litigation alone[1]. Anthropic’s agreement to settle, probably funded by the $13 billion Series F round completed just three days before its announcement—is the strongest signal yet of an emerging commercial standard, putting a price on training AI on protected content, by a mix of opt-out regimes, licensing frameworks, and negotiated exceptions, rather than by litigation strategies.
Key takeaways FOR YOUR BUSINESS
Use of protected content in AI training is becoming a more clearly monetized practice. This means that governing how data is used for training and documenting data provenance is becoming a market standard limiting companies’ exposure in high stakes litigation. What this means for your business is that readiness is key:
- Data Provenance as a Top Compliance Priority: The source and legality of training data will be crucial where practically feasible and cost effective for managing future exposure from training new models and in due diligence for procuring 3rd party AI tools and services.
- Essential Contractual Protections: Training-data warranties, audit trails, indemnities, and model retraining covenants will become standard contractual requirements.
- “Tainted Ownership” Theoretical Risks: It is currently not clear how courts will rule in cases where models were trained on “non fair use” datasets, and how that would affect user ownership over outputs down the supply chain. However, major AI providers already started inserting IP ”shield” clauses in commercial contracts and terms of use, preventing some IP risks from being imposed on user generated outputs downstream. In this context, the Bartz v. Anthropic case may set a commercial standard for commoditizing risk allocation, supplemented by strong assurances, such as data deletion and the retraining of models in case of IP breaches.
GLOBAL & ISRAELI CONSIDERATIONS
While the Anthropic settlement is a U.S. case, its influence will likely extend internationally, as AI operations are highly interconnected and are affected by cross border dynamics. For instance, Getty Images has recently dropped primary copyright claims filed in the UK against Stability AI, citing jurisdictional challenges, as training occurred on U.S. servers, highlighting that plaintiffs, even across the ocean, are closely following the high-stakes proceedings in the US.
In Israel a 2022 Ministry of Justice opinion permitted AI training on protected content under “fair use” defense. But a recent 2024 inter-ministerial interim report for the financial sector, shows a broader focus on how training data is collected. The report flags data scraping as a potential violation of privacy laws, a position also expressed in the recent opinion published by the Protection of Privacy Authority on privacy law application to AI (as of this date, currently a draft open to public comments and consultation). This mirrors the U.S. trend, where the source of data has become an equally crucial issue as how it is used. Data provenance and legality have become a clear regulatory expectation in the EU as well, where the AI office had recently published a template for summaries of copyrighted content used for training, developers of general-purpose AI models are expected to disclose, in line with the GPAI Code of Practice and the EU AI Act.
The market and more regulators will most likely react, and Israeli companies, especially with US operations, will increasingly be expected do devise content and data strategies aligned with emerging commercial standards and sector-specific regulations. Documenting training data provenance and legality, or mitigating risks contractually, are becoming essential for any company deploying or developing AI technology and tools.
Looking forward
The settlement in Bartz v. Anthropic transforms abstract legal risks into concrete commercial realities, marking a pivotal moment where the focus shifts from theoretical fair use debates to the practical and commercial consequences of improperly sourced data.
It may also signal a broader shift in the regulatory cycle concerning AI copyright issues.
After a first wave of litigation focusing on provenance and how training data is used, the market and jurisprudence will have to deal with the next set of new legal questions– when and how do AI-generated outputs infringe third-party rights (for example, Studio Ghibli’s viral case), also addressing the highly complex question of AI developers’ liability as intermediaries in such cases.
Navigating the evolving legal, regulatory and commercial landscape is the basis for managing AI risks and enhancing operational stability. Our AI and Tech Regulation team is available to assist with reviewing AI models training and data exposure, 3rd party AI tools and platforms and data vendor agreements, due diligence on AI tools, drafting and updating internal AI governance policies, training and literacy programs.
[1] The VCR, initially seen as an existential threat to Hollywood, became its financial lifeline—by 1995, home video generated more than half of Hollywood’s revenue. YouTube’s Content ID transformed potential infringement into billions in payments by letting rights holders monetize rather than block user content.