Home Bloomberg Big tech’s trapped in a glass house on AI data snatching

Big tech’s trapped in a glass house on AI data snatching

April 10, 2024 | 12:02 am

By Parmy Olson

Web-coding-cybersecurity — MARKUS SPISK-UNSPLASH

A FEW WEEKS AGO, the chief technology officer of OpenAI videos to train its AI systems. First, she gave a blank stare. Then there was a grimace. Finally, Mira Murati gave an answer that avoided the messy and furtive world she and other tech companies were operating in: “Actually, I’m not sure about that.”

According to a New York Times , OpenAI in fact had trained its AI on “more than one million hours of YouTube videos,” using a speech recognition tool called Whisper. All the conversational text from the transcriptions was used to train GPT-4, the flagship large language model that underpins ChatGPT.

Large tech players racing to build more capable AI models have reached a point where they have fewer and fewer places to look for data on the public web, and taking text from the transcripts of YouTube videos suggests OpenAI has been digging between the proverbial couch cushions, even at the risk of breaking someone’s rules. There’s a decent chance it did. YouTube Chief Executive Officer Neal Mohan that if OpenAI had used YouTube videos to refine its AI, that would be a “clear violation” of YouTube’s terms of use. OpenAI didn’t respond to a request for comment.

Still, it’s hard to see the tension ratcheting up between OpenAI and Google over this. Google, for one, can hardly complain about a data violation when its entire business has been built on collecting the private data of billions of consumers, often at a . Google has also scraped transcription data from some YouTube videos to train its AI models, Mohan .

So ingrained is data harvesting in the business models of firms like Google and Meta Platforms, Inc. that the ethics of using people’s creative work without consent or compensation seems to have become an elephant in the room that simply isn’t discussed. When a lawyer at Meta recently pointed out the ethical concerns of scraping artists’ intellectual property, they were met with silence , which added that Meta executives considered buying a book publisher like Simon & Schuster to get access to more high-quality data but decided that securing licenses would take too long.

In the end, a Meta executive pointed out that, “The only thing that’s holding us back from being as good as ChatGPT is literally just data volume,” the Times reported. Since OpenAI appeared to be taking copyrighted material, Meta could simply follow this “market precedent,” he added.

Of course, Meta itself established the precedent well before OpenAI did, by harvesting vast amounts of personal data from consumers and sharing it with a Byzantine network of third parties. That’s why Mark Zuckerberg himself the mountain of Facebook and Instagram data he’s sitting on as an advantage in the AI race. “The next key part of our playbook is learning from unique data,” he told investors in February. “On Facebook and Instagram, there are hundreds of billions of publicly shared images and tens of billions of public videos.”

Meta and Google didn’t respond to requests for comment.

Has Google tried grabbing some of Meta’s data in the same way OpenAI scraped YouTube? Has Meta tried scraping any of Google’s user data to add to its AI training mountain? We may never know, but it’s plausible that the snatch-and-grab style of data gathering happening in the AI business right now goes beyond OpenAI and YouTube. Mining data is, after all, how these firms became multi-trillion-dollar businesses.

That’s also why it’s hard to see Google or Meta making much of a public fuss about their user data becoming a target for exploitation. That would not only be the ultimate example of throwing stones in glass houses, but it would also remind people of how much their personal lives — and now their creative work — are being turned into someone else’s product.

BLOOMBERG OPINION

Big tech’s trapped in a glass house on AI data snatching

About the deepfake tech behind the bogus Taylor Swift images

Believe Trump when he says he won’t give up power

BSP chief Medalla signals long pause on rates

RELATED ARTICLESMORE FROM AUTHOR

Southeast Asia’s fishermen are being pushed to the brink by fuel costs

BSP ‘considering’ off-cycle rate hike as inflation risks worsen

ADB likely to cut PHL growth outlook anew

About the deepfake tech behind the bogus Taylor Swift images

Believe Trump when he says he won’t give up power

BSP chief Medalla signals long pause on rates

RELATED ARTICLES MORE FROM AUTHOR