Whether it’s giving directions, answering questions, or carrying out requests, speech recognition makes life easier in countless ways. But today the technology is available for only a small fraction of the thousands of languages spoken around the globe. This is because high-quality systems need to be trained with large amounts of transcribed speech audio. This data simply isn’t available for every language, dialect, and speaking style. Transcribed recordings of English-language novels, for example, will do little to help machines learn to understand a Basque speaker ordering food off a menu or a Tagalog speaker giving a business presentation.
Wav2vec Unsupervised (wav2vec-U) is developed to build speech recognition systems that require no transcribed data at all. It rivals the performance of the best supervised models from only a few years ago, which were trained on nearly 1,000 hours of transcribed speech.
How it works:
The method begins with learning the structure of speech from unlabeled audio. Wav2vec 2.0 and a simple k-means clustering method, it segments the voice recording into speech units that loosely correspond to individual sounds. (The word cat, for example, includes three sounds: “/K/”, “/AE/”, and “/T/”.)
A generative adversarial network (GAN) consisting of a generator and a discriminator network takes each audio segment embedded in self-supervised representations and predicts a phoneme corresponding to a sound in language. It is trained by trying to fool the discriminator, which assesses whether the predicted phonemes sequences look realistic. Initially, the transcriptions are very poor, but over time, and with the feedback of the discriminator, they become accurate.
TIMIT and Librispeech measure performance on English speech, for which good speech recognition technology already exists, thanks to large, widely available labeled data sets. However, unsupervised speech recognition is most impactful for languages for which little to no labeled data exists. Therefore, Facebook AI tried their method on other languages and according to them, this technology is particularly interesting for languages for which there simply are not many data resources, such as Swahili, Tatar, and Kyrgyz.
AI technologies like speech recognition should not benefit only people who are fluent in one of the world’s most widely spoken languages. More generally, people learn many speech-related skills just by listening to others around them. This suggests that there is a better way to train speech recognition models, one that does not require large amounts of labeled data.