There has been discussion for a long time now on the subject of deepfake and their use in legal fields (such as the production of audiovisual content) and less. For the uninitiated, deepfake means the exploitation of an artificial intelligence (AI) capable of reproducing and simulating, in an absolutely faithful and credible way, the physical appearance and voice of real people.
A boundless market, which could literally revolutionize the world of entertainment and cinema. Think, for example, of the iconic character voices of the franchises we love so much. And then think about when, inevitably, those voice actors will no longer be there. Employing deepfakes could bring those same voices of deceased voice actors back to life for eventual sequels. And there are those who have already given their consent: last September James Earl Jones, the iconic voice of Darth Vader, has agreed to have his voice used in future Star Wars films using deepfake technology. The Star Wars franchise has already used AI several times to recreate younger versions of Luke Hamill, Carrie Fisher and of the late Peter Cushing.
The AI that imitates the voice perfectly: Vall-E
Today, specifically, we are talking about one of the latest news in the deepfake world: Valley. It is an AI, developed by a team of Microsoft researchers, which has a peculiarity: can perfectly imitate human voices with just 3 seconds of listeningand the incoming audio files don’t need to be of the highest quality.
The researchers who developed Vall-E say most current text-to-speech systems are limited. These limits would derive from their own “dependence on high quality sources and clean data”.
In essence, the AI, in order to credibly synthesize a voice, needs to analyze its sound spectrum, so as to understand all the possible dynamic and tonal variations. Poor quality, compressed, and noisy audio files produce unsatisfactory results. Apparently, however, Vall-E is satisfied with little, indeed: he is able to perfectly imitate the human voice even from a not excellent and short-lived sound clip.
Vall-E is trained with a very large and diverse dataset: over 60,000 hours of spoken English from more than 7,000 voices. The developers explain that the data sent to the artificial intelligence contains “loud speech and inaccurate transcriptions”, especially when compared to that used by other text-to-speech systems.
“The results of the experiment show that Vall-E significantly outperforms the TTS system [Text-to-Speech] state-of-the-art in terms of speech naturalness and speaker similarity,” reads an official document from the development team.
An example of the work of Vall-E
If you are curious to hear the voice (or rather the voices) of Vall-E, you can listen to some of his simulations at this link. There are dozens of AI-spoken phrases on the site, crafted from conversational snippets, cinematic works, and much more. The collection also shows us the expressive skills of Vall-E (which is able to simulate emotions, reactions and intentions such as anger and sadness in the tone of voice): basically what a real-life actor would do. Finally, there are examples of vocal synthesis even in extreme conditions, with very dirty files full of noise.
In short, it is only a first technical demo. At the moment. If these are the premises, the potential is absolutely unlimited.
The risks of AI
However, the world of AI and deepfakes is not all excitement and simulations. The sector, although in constant expansion, has to respond to numerous criticisms. Do you remember the incredible boom of AI lens, Give her and other apps capable of generate art in a procedural way? As we have explained to you thoroughly in this article, apps of this type are accused of violation of privacy and intellectual property. In fact, they often draw freely on the works of other artists on the net, without mentioning or rewarding them.
As far as voice AIs are concerned, these concerns are practically double, given that simulating private voices could lead to potential telemarketing scams and more. Risks that are recognized by the developers of Vall-E themselves, who have declared:
“Because VALL-E is able to synthesize speech faithfully, simulating the identity of the speaker, it could lead to potential risks in the misuse of the model, such as spoofing of speech identification or impersonating a specific speaker. To mitigate those risks, you can create a detection model capable of verifying whether an audio clip is from a VALL-E synthesizer”.
If only John Connor had one in Terminator 2.