The first high-performance Self-Supervised Algorithm that works for speech, vision, and text

Meta AI has open-sourced data2vec, a unified framework for self-supervised deep learning on images, text, and speech audio data.

When evaluated on common benchmarks, models trained using data2vec perform as well as or better than state-of-the-art models trained with modality-specific objectives. The algorithm and experiments were described in a paper published on arXiv.

In fact, Data2vec unifies self-supervised learning by having models learn to predict representations of input data—that is, the values in the hidden layers of a neural network. This abstraction away from input data allows the same training algorithm to be used for many different data types, explains InfoQ in a recent article.

To demonstrate data2vec’s power, the Meta researchers trained models for computer vision (CV), natural language processing (NLP), and speech recognition (SR). Their models outperformed previous self-supervised models on CV and SR tasks, and were “competitive” on NLP. The Meta team further explained:

“In addition to helping accelerate progress in AI, data2vec brings us closer to building machines that learn seamlessly about different aspects of the world around them. It will enable us to develop more adaptable AI, which we believe will be able to perform tasks beyond what today’s systems can do.”

Supervised machine learning often requires training on large hand-labeled datasets to perform well, hence, many researchers have turned to transfer learning, where a model is pre-trained via self-supervised learning on a large unlabeled dataset, then fine-tuned for a specific task.

Numerous pre-trained NLP models, such as BERT, use a masked language model objective for self-supervised training, where the model is trained to predict words or tokens that are masked from an input sequence. Similar objectives have been applied to other domains, but often these different data types are pre-trained with different training objects. For instance, CV models often use a contrastive loss, learning to map similar images to neighborhoods in a latent space.

For data2vec, the Meta team opted to use a masked learning objective, but instead of predicting masked tokens or units of input, the training objective is to predict “contextualized latent representations” based on the entire input. In fact, the model is based on a Transformer network and is used during training in either “teacher” or “student” mode. First, the teacher encodes the full input into a representation.

The Meta researchers used the algorithm to pre-train several models in case to evaluate the performance of data2vec. To do this, the team first implemented “modality-specific feature encoders and masking strategies” to feed into a generic Transformer. They pre-trained three sets of models and evaluated them on the ImageNet (CV), Librispeech (SR), and GLUE (NLP) benchmarks. On ImageNet, the data2vec models outperformed similar-sized ViT models on ImageNet-1K, and on Librispeech, data2vec outperformed “the best prior work,” including HuBERT. On GLUE, the data2vec model performed “competitively” to a baseline RoBERTa model.

On Twitter, lead researcher Alexei Baevski answered several questions about the work. He noted that training the NLP model took “about 3.5 days” using 16 GPUs.

Check out our new work on a new SSL method called data2vec. This gets SOTA for same-size models on vision, speech and NLP with the same pre-training task (trained separately). Code and models for speech and text are out, and vision will be coming shortly! https://t.co/Z1ZKWq9f5P

— Alexei Baevski (@alexei_baevski) January 20, 2022

The data2vec code and pre-trained models for SR and NLP are available on GitHub. The CV model is not currently available, but is listed as “coming soon.”