Researchers at Google have unveiled MusicLM, an artificial intelligence model that can generate music with high fidelity from text. MusicLM creates music at a constant frequency of 24 kHz over several minutes, modeling the process of generating conditional music as a hierarchical sequence-to-sequence modeling problem.
MusicLM, a model that generates high fidelity music from textual descriptions, such as “a soothing violin melody backed by a distorted guitar riff”. MusicLM represents the conditional music generation process as a hierarchical sequence-to-sequence modeling task and generates music at a frequency of 24 kHz that remains constant over several minutes.
Experiments show that MusicLM outperforms previous systems both in terms of sound quality and adherence to textual description. Furthermore, they demonstrate that MusicLM can be conditioned by both text and melody, as it can transform played and hummed melodies according to the style described in the textual description. To support future research, MusicCaps, a dataset composed of 5.5 thousand music-text pairs, with rich textual descriptions provided by human experts, is publicly released.
According to the research paper, MusicLM is trained on a dataset with 280,000 hours of music to create songs that make sense for complex descriptions. The researchers also claim that their model outperforms previous systems in terms of both sound quality and adherence to textual description, Infoq wrote on the topic.
MusicLM’s samples, include five-minute pieces created from just one or two words like melodic techno, as well as 30-second samples that sound like entire songs and are formed from paragraph-long descriptions that prescribe genre, vibe, and even specific instruments.
Google is more cautious with MusicLM than some of their competitors may be with similar technology, as it has been with previous attempts at this form of artificial intelligence. The article ends with the statement, “We have no plans to reveal any models at this stage.”