MLS provides more than 50,000 hours of audio across eight languages: English, German, Dutch, French, Spanish, Italian, Portuguese, and Polish. It also provides language-model training data and pre-trained language models along with baselines to help researchers compare different ASR systems.
The Facebook AI team has trained the AI in baseline acoustic models and in decoding them using a 5-gram language model for each of the languages. While evaluating the model trained on MLS’s English subset against the standard noisy test set of LibriSpeech, we produced a 20% improvement in word error rate compared with the same model trained using LibriSpeech data.
MLS is a read-speech data set that leverages LibriVox audiobook data. It builds on the widely used LibriSpeech ASR benchmark, making it larger scale and extending it from English-only to the seven other languages noted above.
Open datasets and benchmarks have been key drivers of recent advances across AI. MLS provides a valuable resource for research in large-scale training of ASR systems. Its English-language data set is about 47x larger than the training data present in LibriSpeech. While there are data sets and benchmarks for non-English languages, they are often relatively small or scattered around different places and rarely available under an open, permissive license.
]]>