New AI Can Clone Voices With Only Seconds Of A Sample

New Voice AI is being used to clone people's voices after the program listens to their voices for only three seconds.

By Charlene Badasie | Published 1 year ago

New text-to-speech AI technology from Microsoft can clone a voice after just three seconds. Called VALL-E, the program was created after the system listened to 60,000 hours of English audiobook narration from 7,000 speakers. The goal was to get the AI program to reproduce human-sounding speech. This sample size is much larger than those similar programs have been built on.

The Microsoft team also developed a website that features several demos of VALL-E. The program can manipulate the voice to say anything using AI prompts. It can also replicate the emotion or be configured into different speaking styles, PC Mag reports. While cloning has been around for a while, Microsoft has made it easier to replicate voices.

Unfortunately, this makes it easier for technology to fuel cybercrime. Microsoft has also acknowledged that its voice-learning AI could be a potential security threat. “Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model,” researchers said in their paper.

This includes spoofing voice identification or impersonating a specific speaker. But Microsoft says it might be possible to create a program that knows if a voice clip was synthesized by AI. But for now, the company will refrain from making the code open source due to the risks, The Byte reports. VALL-E interprets audio speech as “discrete tokens” and reproduces it to speak.

The system then generates the corresponding acoustic tokens conditioned on the acoustics of the three-second recording. The generated acoustic tokens are then used to synthesize the final waveform with the corresponding neural codec decoder. In addition to mimicking voice timbre and emotional tone, the AI can also copy the acoustic environment of the sample audio.

For instance, if the voice sample came from a telephone call, the AI will simulate the sound and frequency properties of a phone call in its synthesized output. That’s just a fancy way of saying it will sound like a telephone call too, ARS Technica reports.

Samples from the Microsoft research team, in its Synthesis of Diversity section, also show that VALL-E can generate variations in tone by changing the random seed used in the process. But as impressive as the technology sounds, the voice-replicating AI program still has a few problems.

VALL-E sometimes struggles to pronounce words. The words can also sound artificially synthesized or robotic. These errors are unavoidable when working with machine learning artificial intelligence. According to the research team, even 60K hours of voice data is not enough to train the AI. This is especially true in terms of accents and dialects.

Additionally, the diversity of speaking styles and voices is not sufficient. LibriLight, the voice the AI was trained on, is an audiobook dataset that features one particular reading style. Still, research states that creating a more accurate voice cloning program is achievable.

The voice AI system just needs to be trained with more audio clips. It remains to be seen when this ground-breaking program will be made available to the public.