An AI training dataset used by tech giants was allegedly created by copying YouTube videos in violation of its terms and conditions.
Non-profit artificial intelligence research group#EleutherAIextracted subtitles from YouTube to create a dataset, which is a violation of YouTube's terms of service, ProofNews reported on July 16.
The dataset, called Pile, supposedly includes subtitles for 173,536 YouTube videos from over 48,000 channels. Around 12,000 deleted videos are part of the dataset.
Several leading tech and AI firms, including Anthropic, have since used Pile for training.#Anthropicspokesperson Jennifer Martinez said the dataset included "a very small subset of YouTube subtitles" but declined to comment on possible violations of #YouTube's terms of service.
Business software company Salesforce also used this data set. Salesforce vice president of artificial intelligence research Kaiming Xiong said the data set was "publicly available" and that Salesforce was using it for academic and research purposes. ProofNews reported that Salesforce eventually published the same data set.
Apple used Pile to train #OpenELM, an efficient language model for on-device artificial intelligence. #Nvidia,#Bloombergand#Databrickshave also used#Pileto train artificial intelligence.
#ProofNews reported that the list of companies that used this dataset is not exhaustive, as companies do not always disclose which datasets they use to train AI.
#news#crypto#Techbullion