Microsoft has introduced VALL-E, a new text-to-speech (TTS) language model method that uses audio codecs as intermediate representations and can reproduce anyone’s voice after listening to just three seconds of audio recording, Infoq wrote on the topic.
According to the research paper, VALL-E can not only create high-quality personalized speech with just a three-second recording of a tilted speaker acting as an acoustic stimulus. It does this without the need for additional structural engineering, pre-designed acoustic characteristics or fine-tuning. It supports contextual learning and TTS approaches based on zero-shot prompts.
Audio demonstrations of the AI model in action are provided by VALL-E. “Speaker Prompt”, one of the samples that VALL-E needs to duplicate. For comparison purposes, the “Ground Truth” is a pre-recorded excerpt from the same speaker that uses a particular phrase (sort of like the “control” in the experiment). The “Baseline” sample represents a typical example of text-to-speech synthesis, and the “VALL-E” sample represents the output of the VALL-E model.
TTS technology has been integrated into a wide range of applications and devices, such as virtual assistants like Amazon’s Alexa and Google Assistant, navigation applications, and e-learning platforms. It is also used in industries such as entertainment, advertising and customer service to create more engaging and personalized experiences.