Marketing AI Institute | Blog

The New York Times Sues OpenAI and Microsoft

Written by Mike Kaput | Jan 9, 2024 4:19:09 PM

The New York Times has sued OpenAI and Microsoft for copyright infringement. It’s a landmark legal battle that will have huge implications for AI and media moving forward.

The lawsuit claims that the two trained their AI on millions of copyrighted articles. While the lawsuit doesn’t list specific damages, it says the two should pay “billions.” It also calls on them to destroy models and training data that used copyrighted material.

“I think this one is a really big deal,” says Marketing AI Institute founder/CEO Paul Roetzer.

On Episode 78 of The Marketing AI Show, he unpacked for me what’s going on and why it all matters.

Why Is The New York Times Suing OpenAI?

This matters for a couple reasons, says Roetzer.

One, it’s The New York Times bringing the lawsuit. Two, the case appears to be very well made.

To understand why, it helps to take a step back and understand why this is happening—and what’s at stake.

All AI models are trained on data. For example, ChatGPT is powered by GPT-4, an AI model that is trained on data. That’s how it’s able to answer questions, write articles, write emails, etc. It has tons of data it’s learned from.

The higher quality the data, the better the model is able to perform. So, if you want a model like GPT-4 to be able to write well, you want it to learn from the best written content available. So, you want to train it on the best examples of writing and the greatest depth of knowledge possible.

For that, you need legitimate sources, not random Reddit comments and threads on X.

Google, Meta, Amazon, and some others already have quality data they can train their models on. This data comes from their proprietary search, social, and ecommerce network.

But companies like OpenAI don't have their own data to train their models on. They have to train their models on other peoples’ data.

Hence the lawsuit. The Times alleges OpenAI didn't have permission or legal rights to train GPT-4 on their data.

And the simple scenario of data or no data outlined above means that AI companies knew this.

“These AI companies certainly knew going in that it was a gray area that was likely going to be challenged legally,” says Roetzer.

How Strong Is The New York Times’ Lawsuit?

It appears The Times has a solid case.

Now we're not attorneys. But very smart people like Cecelia Ziniti are. She specializes in tech law and used to be general counsel at major AI player Replit. And she broke down some important points in a recent X thread.

First, she says that the lawsuit is very clear in its claim of copyright infringement. OpenAI used a repository of crawled websites called Common Crawl to train GPT-4. The lawsuit shows The Times is the biggest proprietary dataset in Common Crawl. (It's behind only Wikipedia and a database of U.S. patent documents.)

 

Second, she says the lawsuit makes very clear that GPT-4 is plagiarizing from The Times. It compares GPT-4 outputs vs. The Times' content side-by-side. And it's clear the content is lifted straight from the publication.

What Does This Mean for AI and Media Companies?

This doesn't just matter to the lawsuit. It also has bigger implications across the industry.

This isn't just about content used illegally from The Times. Common Crawl also uses content from other publications. That includes content from The Washington Post, Forbes, The Huffington Post, and others.

“If you start going down this list, you realize we’re just talking about the tip of the iceberg here,” says Roetzer. “Because if the New York Times has a case, then so does the Washington Post, Forbes, Huffington Post, all of them have the exact same potential issues. So that’s a really big problem.”

What’s Going to Happen Next?

Right now, it's unclear how this gets resolved. But Roetzer sees a couple possible paths forward.

One is that AI companies settle. They pay a few billion to resolve lawsuits without admitting wrongdoing. Then they rely on proprietary, licensed, or synthetic data to train all future models.

“They’ll just get around it by saying ‘we’re not going to train on stuff we’re stealing from people anymore,’” says Roetzer.

Another is that AI companies buy or build their own media companies to train future models. That way, they control the source data and reap the benefits of owning media narratives.

Some examples:

Jeff Bezos owns The Washington Post. Salesforce’s Marc Benioff owns Time Magazine. Elon Musk bought Twitter, now X, in part for its data. All of these have archives of proprietary content that could be used to train future AI models.

Even if it costs an arm and a leg to buy outlets outright, it may make more sense in the long run, says Roetzer.

“OpenAI and others can pay millions or billions in licensing fees and basically rent the data. Or they can just buy media outlets for less and scrap a dying advertising model that’s barely sustaining journalism as it is.”

If that happens, the outcome might be completely ironic, he says.

“Journalism is dying. You can’t fund local journalism through ad models. And so, in this great ironic twist, there’s a chance AI actually saves journalism rather than steals from it.”

Of course, it could also go a negative direction, warns Roetzer. AI companies that own media outlets could then control what we see as truth and public record.

One thing is certain, though:

No matter which way this all goes...

The lawyers at OpenAI will be busy for the foreseeable future.