Categories: Tech

Vall-E is an AI capable of imitating the human voice perfectly

There has been discussion for a long time now on the subject of deepfake and their use in legal fields (such as the production of audiovisual content) and less. For the uninitiated, deepfake means the exploitation of an artificial intelligence (AI) capable of reproducing and simulating, in an absolutely faithful and credible way, the physical appearance and voice of real people.

A boundless market, which could literally revolutionize the world of entertainment and cinema. Think, for example, of the iconic character voices of the franchises we love so much. And then think about when, inevitably, those voice actors will no longer be there. Employing deepfakes could bring those same voices of deceased voice actors back to life for eventual sequels. And there are those who have already given their consent: last September James Earl Jones, the iconic voice of Darth Vader, has agreed to have his voice used in future Star Wars films using deepfake technology. The Star Wars franchise has already used AI several times to recreate younger versions of Luke Hamill, Carrie Fisher and of the late Peter Cushing.

The AI ​​that imitates the voice perfectly: Vall-E

Today, specifically, we are talking about one of the latest news in the deepfake world: Valley. It is an AI, developed by a team of Microsoft researchers, which has a peculiarity: can perfectly imitate human voices with just 3 seconds of listeningand the incoming audio files don’t need to be of the highest quality.

The researchers who developed Vall-E say most current text-to-speech systems are limited. These limits would derive from their own “dependence on high quality sources and clean data”.

In essence, the AI, in order to credibly synthesize a voice, needs to analyze its sound spectrum, so as to understand all the possible dynamic and tonal variations. Poor quality, compressed, and noisy audio files produce unsatisfactory results. Apparently, however, Vall-E is satisfied with little, indeed: he is able to perfectly imitate the human voice even from a not excellent and short-lived sound clip.

Vall-E is trained with a very large and diverse dataset: over 60,000 hours of spoken English from more than 7,000 voices. The developers explain that the data sent to the artificial intelligence contains “loud speech and inaccurate transcriptions”, especially when compared to that used by other text-to-speech systems.

“The results of the experiment show that Vall-E significantly outperforms the TTS system [Text-to-Speech] state-of-the-art in terms of speech naturalness and speaker similarity,” reads an official document from the development team.

An example of the work of Vall-E

If you are curious to hear the voice (or rather the voices) of Vall-E, you can listen to some of his simulations at this link. There are dozens of AI-spoken phrases on the site, crafted from conversational snippets, cinematic works, and much more. The collection also shows us the expressive skills of Vall-E (which is able to simulate emotions, reactions and intentions such as anger and sadness in the tone of voice): basically what a real-life actor would do. Finally, there are examples of vocal synthesis even in extreme conditions, with very dirty files full of noise.

In short, it is only a first technical demo. At the moment. If these are the premises, the potential is absolutely unlimited.

The risks of AI

However, the world of AI and deepfakes is not all excitement and simulations. The sector, although in constant expansion, has to respond to numerous criticisms. Do you remember the incredible boom of AI lens, Give her and other apps capable of generate art in a procedural way? As we have explained to you thoroughly in this article, apps of this type are accused of violation of privacy and intellectual property. In fact, they often draw freely on the works of other artists on the net, without mentioning or rewarding them.

As far as voice AIs are concerned, these concerns are practically double, given that simulating private voices could lead to potential telemarketing scams and more. Risks that are recognized by the developers of Vall-E themselves, who have declared:

“Because VALL-E is able to synthesize speech faithfully, simulating the identity of the speaker, it could lead to potential risks in the misuse of the model, such as spoofing of speech identification or impersonating a specific speaker. To mitigate those risks, you can create a detection model capable of verifying whether an audio clip is from a VALL-E synthesizer”.

If only John Connor had one in Terminator 2.

Published by
Walker Ronnie

Recent Posts

Lomomatic 110: the new Lomography returns to the 70s format

Lomography has launched a new analogue camera, the Lomomatic 110, compact, pocket-sized with the 110…

2 hours ago

F1, Sprint Qualifying: here's who took pole position

F1 is on stage in the USA for the Miami GP where the qualifying sessions…

2 hours ago

Deepfakes of US actors lashing out at Ukraine

We have now understood that the latest advances in generative artificial intelligence can bring enormous…

2 hours ago

Xbox Games Showcase 2024 will be held in June

Microsoft has lifted a veil of mystery and anticipation by announcing the upcoming Xbox Games…

2 hours ago

Free Link 2000 Dice of Monopoly Go & Tokens

If you are looking for link 2000 of free monopoly go, you have landed on…

3 hours ago

Hellblade 2: Senua's Saga, PC version requirements revealed!

In the past few hours, the requirements for the PC version of Hellblade 2: Senua's…

5 hours ago