As generative artificial intelligence booms, tech companies are looking for training data to improve their models — and some are taking it without permission.
Apple, Nvidia and Anthropic are among the tech companies that have been recognized as having AI models trained with captions from tens of thousands of YouTube videos despite the platform rules against downloading and using its content without permissionaccording to a Proof News investigation co-published with Wired.
The investigation found that the companies used a dataset called YouTube Subtitles that included transcripts of 173,536 YouTube videos from more than 48,000 channels. The videos in the dataset range from educational channels like Khan Academy and MIT, to news sites like The Wall Street Journal, to some of the platform’s top creators like MrBeast and Marques Brownlee.
“Apple has obtained data for its AI from multiple companies,” Brownlee wrote in a publish on X “One of them scraped tons of data/transcripts from YouTube videos, including mine.”
Brownlee added that while “Apple technically avoids ‘fault’ here because they’re not the ones recovering the data,” “this is going to be an evolving issue for a long time.”
Proof News also created a tool for creators to search for their content in the dataset, which included a handful of videos from Quartz. The YouTube captions dataset does not include images from videos, but does include some captions translated into languages such as German and Arabic.
The dataset was created by Eleuther AI, “a nonprofit AI research lab” that is focused on “promote open scientific standards,” and is part of a compilation of documents from elsewhere, including the European Parliament and the English Wikipedia, called the Stack, according to Proof News.
“The Pile dataset mentioned in the research paper was formed in 2021 for academic and research purposes,” a spokesperson for Salesforce, one of the companies named in the investigation for using the dataset, said in a statement shared with Quartz. “The dataset was publicly available and released under a permissive license.”
Neither Apple, Nvidia nor Anthropic immediately responded to a request for comment.
In April, YouTube CEO Neal Mohan told Bloomberg that companies using YouTube videos, including transcripts or video clips, to train AI models such as OpenAI’s text-to-video generator, Sora, would be a “flagrant violation” of the platform’s policies. However, the New York Times reported days later that OpenAI had transcribed over a million hours of YouTube videos to train its GPT-4 model.