DALL-E chiede aiuto a ChatGPT, due AI che lavorano insieme thumbnail

OpenAI: One million hours of YouTube videos to train GPT-4

In the world of artificial intelligence, the shortage of high-quality training data represents a growing challenge. OpenAI would take a controversial approach to overcome this obstacle: transcribe and use over a million hours of YouTube videos to train its language model GPT4. With many experts wondering if this does not violate the copyright of whoever wrote and shot those videos.

OpenAI: One million hours of YouTube videos to train GPT-4

At the desperate for training dataOpenAI would develop the audio transcription model Whisper to transcribe a vast amount of YouTube videos. According to the New York Times (via The Verge), OpenAI was aware of the dubious legality of this practice, but considered it a “correct use”. The New York newspaper, which is suing the company for violating the newspaper's copyright, reports that the president Greg Brockman he would have been personally involved in the collection of the videos used.

The spokeswoman Lindsay Held said OpenAI curates “unique” datasets for each model, using “numerous sources including publicly available data and non-public data partnerships.” But the article reveals that OpenAI had run out of supplies of useful data in 2021prompting her to evaluate YouTube video, podcast, and audiobook transcription after reviewing other resources such as Github code and Quizlet content.

openai dall and edit images chatgpt minopenai dall and edit images chatgpt min

Google commented that it had “seen unconfirmed reports” of GPT-4 training on YouTube. However, he explains that, to train the own AI Gemini model, it would collect transcripts from YouTube in accordance with creator agreements, while prohibiting “unauthorized scraping or downloading of content.”

Between AI and copyright

OpenAI's choice to transcribe and use YouTube videos to train GPT-4 raises legal and ethical questions. Although the company considers this to be “fair use,” this practice may represent a violation of YouTube's copyright and usage policies.

As AI companies look for solutions to address the shortage of training data (in addition to OpenAI and Google, there is also Meta and more), it remains interesting to understand how to manage intellectual property rights and privacy. A topic that, we are sure, will be talked about again.

Don't miss this week on Techbusiness

💡 Fastweb enters the energy market: Fastweb Energia electricity offers
🤖 Apple wants to bring robots into our homes
🎸 What is the Elvis Act, which wants to protect artists from AI
📺 The success of free streaming TV channels: Interview with Marcos Milanez from Rakuten TV
✒️ Our unmissable Caffellattech newsletter! Sign up here
🎧 But did you know that Fjona also has her own newsletter?! Sign up to SuggeriPODCAST!
📺 You can also find Fjona on RAI Play con Touch – Fingerprint!
💌 Let's solve your heart problems with B1NARY
🎧 Listen to our unmissable podcast Tech life
💸And you can find some interesting offers on Telegram!


The New York Times

Walker Ronnie is a tech writer who keeps you informed on the latest developments in the world of technology. With a keen interest in all things tech-related, Walker shares insights and updates on new gadgets, innovative advancements, and digital trends. Stay connected with Walker to stay ahead in the ever-evolving world of technology.