Facebook AI Open-Sources Multilingual LibriSpeech

Facebook AI is releasing Multilingual LibriSpeech (MLS), a large-scale, open-source data set designed to help advance research in automatic speech recognition (ASR). MLS is designed to help the speech research community’s work in languages beyond just English so people around the world can benefit from improvements in a wide range of AI-powered services. The information was announced in Facebook AI blog.

MLS provides more than 50,000 hours of audio across eight languages: English, German, Dutch, French, Spanish, Italian, Portuguese, and Polish. It also provides language-model training data and pre-trained language models along with baselines to help researchers compare different ASR systems.

The Facebook AI team has trained the AI in baseline acoustic models and in decoding them using a 5-gram language model for each of the languages. While evaluating the model trained on MLS’s English subset against the standard noisy test set of LibriSpeech, we produced a 20% improvement in word error rate compared with the same model trained using LibriSpeech data.

MLS is a read-speech data set that leverages LibriVox audiobook data. It builds on the widely used LibriSpeech ASR benchmark, making it larger scale and extending it from English-only to the seven other languages noted above.

Open datasets and benchmarks have been key drivers of recent advances across AI. MLS provides a valuable resource for research in large-scale training of ASR systems. Its English-language data set is about 47x larger than the training data present in LibriSpeech. While there are data sets and benchmarks for non-English languages, they are often relatively small or scattered around different places and rarely available under an open, permissive license.