Federal Court Orders OpenAI to Produce Over 100 Million ChatGPT Logs as Copyright Discovery Battle Escalates

Overview

The consolidated copyright case against OpenAI in Manhattan federal court has produced a series of discovery rulings that could reshape how AI companies defend their use of copyrighted training data. Over the first quarter of 2026, U.S. District Judge Sidney Stein and Magistrate Judge Ona Wang have ordered OpenAI to produce more than 100 million de-identified ChatGPT conversation logs, disclose documents related to a deleted dataset of pirated books, and make in-house lawyers available for deposition.

The case, styled In re: OpenAI, Inc. Copyright Infringement Litigation (No. 1:25-md-03143, S.D.N.Y.), consolidates 16 separate copyright lawsuits from news organizations, authors, and publishers into a single multidistrict litigation proceeding, as documented on CourtListener.

The Log Production Orders

The discovery dispute began in November 2025 when Magistrate Judge Wang ordered OpenAI to produce 20 million de-identified ChatGPT output logs. OpenAI had initially proposed the 20-million figure itself as a compromise — the original plaintiff request sought 120 million logs from a preserved pool of tens of billions of conversations. However, after agreeing to the sample size, OpenAI reversed course in October 2025 and asked to restrict production to only conversations containing keyword searches matching specific plaintiffs’ copyrighted works.

Judge Stein affirmed the full production order on January 5, 2026, rejecting OpenAI’s attempt to filter the sample. The court found that even logs not directly reproducing plaintiffs’ works are discoverable because they bear on OpenAI’s fair use defense. Logs showing what ChatGPT produces across a broad range of queries could reveal patterns relevant to whether the model’s outputs compete with or substitute for copyrighted content.

On the question of user privacy, Judge Stein drew a sharp distinction between covert surveillance and voluntary disclosure to a commercial service: privacy interests in conversations that users voluntarily submitted to OpenAI are fundamentally different from those implicated by illegal wiretaps. Three safeguards — reducing the sample from billions to 20 million, de-identification of personally identifiable information, and an existing protective order — were deemed sufficient.

On March 9, 2026, the court expanded discovery further, granting plaintiffs’ motion to compel production of two additional reservoirs of 78 million and 10 million logs, bringing the total ordered production to more than 108 million conversation records.

Deleted Training Data and Privilege Waiver

A parallel discovery battle has focused on datasets known internally as Books1 and Books2, which OpenAI used to train earlier models. The datasets were sourced from Library Genesis and other shadow libraries containing pirated copies of copyrighted books. OpenAI deleted the datasets during the litigation, prompting plaintiffs to seek documents explaining the company’s motivations.

In November 2025, Magistrate Judge Wang ruled that OpenAI had waived attorney-client privilege over communications about the deletion. The court found that OpenAI gave inconsistent explanations — first citing “non-use” as the reason for deletion, then claiming all reasons were privileged. The ruling required OpenAI to produce the relevant documents by December 8, 2025, and to make in-house lawyers available for deposition by December 19.

Motion to Dismiss Denied

The discovery escalation came after Judge Stein denied OpenAI’s motion to dismiss in October 2025. The court held that plaintiffs had plausibly alleged that certain ChatGPT outputs are substantially similar to copyrighted works. In one cited example, ChatGPT summaries of George R.R. Martin novels were found sufficiently similar to the originals to proceed to trial.

The court was careful to note that its ruling “is not intended to suggest a view on whether the allegedly infringing outputs are protected as fair uses,” leaving the central legal question for later proceedings.

A Broader Litigation Wave

The OpenAI MDL is the largest but far from the only front in the copyright battle over AI training. Nearly 100 copyright lawsuits have been filed against AI companies as of early April 2026, with at least ten new complaints arriving in March alone. Plaintiffs range from individual authors and visual artists to major institutions: Encyclopedia Britannica and Merriam-Webster sued OpenAI on March 18 alleging that ChatGPT reproduces their curated definitions and encyclopedia entries verbatim, “starving web publishers of revenue” by delivering polished answers instead of directing users to original sources.

Disney and Universal have a pending lawsuit against Midjourney over the alleged use of copyrighted character images to train its image-generation model, with a post-mediation status conference scheduled for August 31, 2026, and a trial expected in late 2026.

The litigation landscape is not uniformly hostile to AI companies. In the Kadrey v. Meta case, a federal judge granted summary judgment to Meta, holding that training its Llama models on copyrighted books constituted fair use — though the ruling turned on the plaintiffs’ failure to develop sufficient evidence of market harm rather than an endorsement of AI training practices generally.

The Anthropic Precedent

The largest resolution to date came outside the courtroom. In September 2025, Anthropic agreed to pay $1.5 billion to settle the Bartz v. Anthropic class action, the largest publicly reported copyright settlement in history, according to NPR. The settlement covers approximately 500,000 books that Anthropic downloaded from pirate sites to train its Claude models, working out to roughly $3,000 per work. Anthropic also agreed to destroy the original pirated files.

Final approval of the settlement is scheduled for a hearing on April 23, 2026, in San Francisco. The settlement’s per-work valuation has become a reference point for other plaintiffs negotiating with AI companies.

Legislative Crosscurrents

Congress has begun to weigh in. Senator Marsha Blackburn released a discussion draft of the TRUMP AMERICA AI Act in March 2026, which explicitly states that unauthorized use of copyrighted works for training AI “shall not constitute fair use.” The White House took a different approach in its National AI Legislative Framework, acknowledging differing views on fair use and recommending that courts resolve the question through litigation while encouraging market-led licensing arrangements.

The divergence between legislative and judicial approaches means that the OpenAI MDL and similar cases will likely define the boundaries of AI copyright law long before Congress acts. With more than 108 million ChatGPT conversation logs now headed to plaintiffs’ experts for analysis, the fair use defense that underpins the entire generative AI industry faces its most rigorous examination yet.