Meta AI recently open-sourced data2vec, a unified framework for self-supervised deep learning on images, text, and speech audio data. When evaluated on common benchmarks, models trained using data2vec perform as well as or better than state-of-the-art models trained with modality-specific objectives, noted InfoQ.
Data2vec is a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture, told arXiv.
According to a post in Meta’s blog, data2vec is simplifying the different algorithms functioning by training models to predict their own representations of the input data, regardless of the modality. A single algorithm can work with completely different types of input. This removes the dependence on modality-specific targets in the learning task. Directly predicting representations is not straightforward, and it requires defining a robust normalization of the features for the task that would be reliable in different modalities.