Big tech鈥檚 trapped in a glass house on AI data snatching

A FEW WEEKS AGO, the chief technology officer of OpenAI videos to train its AI systems. First, she gave a blank stare. Then there was a grimace. Finally, Mira Murati gave an answer that avoided the messy and furtive world she and other tech companies were operating in: 鈥淎ctually, I鈥檓 not sure about that.鈥
According to a New York Times , OpenAI in fact had trained its AI on 鈥渕ore than one million hours of YouTube videos,鈥 using a speech recognition tool called Whisper. All the conversational text from the transcriptions was used to train GPT-4, the flagship large language model that underpins ChatGPT.
Large tech players racing to build more capable AI models have reached a point where they have fewer and fewer places to look for data on the public web, and taking text from the transcripts of YouTube videos suggests OpenAI has been digging between the proverbial couch cushions, even at the risk of breaking someone鈥檚 rules. There鈥檚 a decent chance it did. YouTube Chief Executive Officer Neal Mohan that if OpenAI had used YouTube videos to refine its AI, that would be a 鈥渃lear violation鈥 of YouTube鈥檚 terms of use. OpenAI didn鈥檛 respond to a request for comment.
Still, it鈥檚 hard to see the tension ratcheting up between OpenAI and Google over this. Google, for one, can hardly complain about a data violation when its entire business has been built on collecting the private data of billions of consumers, often at a . Google has also scraped transcription data from some YouTube videos to train its AI models, Mohan .
So ingrained is data harvesting in the business models of firms like Google and Meta Platforms, Inc. that the ethics of using people鈥檚 creative work without consent or compensation seems to have become an elephant in the room that simply isn鈥檛 discussed. When a lawyer at Meta recently pointed out the ethical concerns of scraping artists鈥 intellectual property, they were met with silence , which added that Meta executives considered buying a book publisher like Simon & Schuster to get access to more high-quality data but decided that securing licenses would take too long.
In the end, a Meta executive pointed out that, 鈥淭he only thing that鈥檚 holding us back from being as good as ChatGPT is literally just data volume,鈥 the Times reported. Since OpenAI appeared to be taking copyrighted material, Meta could simply follow this 鈥渕arket precedent,鈥 he added.
Of course, Meta itself established the precedent well before OpenAI did, by harvesting vast amounts of personal data from consumers and sharing it with a Byzantine network of third parties. That鈥檚 why Mark Zuckerberg himself the mountain of Facebook and Instagram data he鈥檚 sitting on as an advantage in the AI race. 鈥淭he next key part of our playbook is learning from unique data,鈥 he told investors in February. 鈥淥n Facebook and Instagram, there are hundreds of billions of publicly shared images and tens of billions of public videos.鈥聽 聽 聽
Meta and Google didn鈥檛 respond to requests for comment.
Has Google tried grabbing some of Meta鈥檚 data in the same way OpenAI scraped YouTube? Has Meta tried scraping any of Google鈥檚 user data to add to its AI training mountain? We may never know, but it鈥檚 plausible that the snatch-and-grab style of data gathering happening in the AI business right now goes beyond OpenAI and YouTube. Mining data is, after all, how these firms became multi-trillion-dollar businesses.
That鈥檚 also why it鈥檚 hard to see Google or Meta making much of a public fuss about their user data becoming a target for exploitation. That would not only be the ultimate example of throwing stones in glass houses, but it would also remind people of how much their personal lives 鈥 and now their creative work 鈥 are being turned into someone else鈥檚 product.
BLOOMBERG OPINION


