OpenAI's CTO Mira Murati just did a controversial interview with the Wall Street Journal about the company's new AI video model, Sora.
The interview started strong, with plenty of discussion about Sora's remarkable (but flawed) capabilities at generating video from text.
But things took a problematic turn when the reporter pressed Murati on where the training data for Sora came from. Murati stumbled, claiming she wasn't sure if YouTube, Instagram or Facebook videos were used. She vaguely cited "publicly available data" before shutting down further questions.
The segment lit up the AI world. If it's legal for AI companies to use copyrighted material for training, why the secrecy around data sources? The optics aren't great.
So was this just a PR blunder or a concerning peek behind the curtain of AI development?
I got the scoop from Marketing AI Institute founder/CEO Paul Roetzer on Episode 88 of The Artificial Intelligence Show.
This PR blunder is really a legal issue in disguise
On the surface, OpenAI CTO Mira Murati's failure to answer basic questions about Sora's training data in a Wall Street Journal interview seems like a rookie PR mistake.
As a former PR pro, Roetzer is baffled at how a major tech company could let their C-suite face such a high-profile interview unprepared. Especially on a question as obvious as data sourcing.
"We used to run workshops teaching people how to prep for media interviews," he says. "This would have been one of the questions I would have prepared this person for."
But the faux pas reveals a much thornier issue lurking beneath the surface:
Why the secrecy or uncertainty around AI training data?
Here's the crux of the problem, says Roetzer. If it's truly fair use for AI companies to train their models on copyrighted data, as they claim in ongoing lawsuits, then why not just say what data they used?
"If you think you're allowed to take videos from YouTube to train these models, then why don't you just say ‘We used YouTube videos’?" he asks.
The fact that OpenAI's own CTO can't or won't answer that question suggests a lack of confidence in their legal stance. There's simply no way Murati doesn't know what data Sora trained on, says Roetzer.
The rest of the interview
That said, Murati did reveal some key details on Sora in the interview. She said the model takes a few minutes to generate each video and is still significantly more expensive to run than ChatGPT or DALL-E. (However, the goal is to get costs down to DALL-E levels by launch.)
The interview also contained demos that OpenAI generated for The Wall Street Journal. It's important to note that, while these demos were impressive, they had flaws.
For instance, in a video meant to recreate the "bull in a china shop" idiom, the bull walks through dishware without breaking anything. Other inconsistencies, like a yellow taxi cab morphing into a gray sedan, cropped up in the custom videos.
This aligns with a pattern we've seen before, says Roetzer—we only see the slickest demos. And real-world performance often remains far behind the highlight reels.
In the end, Roetzer feels the interview shed valuable light on Sora's development, even with the crucial question of training data left conspicuously unanswered.
But that unanswered question looms large. How AI models are trained, and on whose data, cuts to the heart of the ethical and legal dilemmas that will define this technology.
Mike Kaput
As Chief Content Officer, Mike Kaput uses content marketing, marketing strategy, and marketing technology to grow and scale traffic, leads, and revenue for Marketing AI Institute. Mike is the co-author of Marketing Artificial Intelligence: AI, Marketing and the Future of Business (Matt Holt Books, 2022). See Mike's full bio.