The world of music is rich in diverse modalities: sheet music, audio recordings, live performances, and of course, textual descriptions – from lyrics to musicological analyses. However, searching for music across different formats and languages is often complex. A new AI model called CLaMP 3 promises a solution. It is a universal framework for Music Information Retrieval (MIR) that connects various modalities and languages, enabling a more comprehensive and precise search.
CLaMP 3 utilizes contrastive learning to map different music modalities such as sheet music, audio signals, and performance recordings with multilingual text into a shared representation space. This approach allows searching across modalities not directly linked to each other, with text acting as a bridge. For example, a piece of music can be searched by entering a textual description, even if no audio recording or sheet music is available. Conversely, an audio recording can be used to find similar pieces of music or related textual information.
A special feature of CLaMP 3 is its ability to handle different languages and even generalize to unknown languages. The multilingual text encoder allows searching for music based on textual descriptions in various languages, significantly facilitating global music research and discovery. This cross-lingual generalization is a significant advancement in the field of MIR and opens up new possibilities for intercultural musical exchange.
The training of CLaMP 3 is based on a comprehensive dataset called M4-RAG. This dataset comprises 2.31 million music-text pairs and was created using Retrieval-Augmented Generation. M4-RAG contains detailed metadata representing a broad spectrum of global music traditions. This diversity is crucial to ensure the model's generalizability and enable searching across different cultures and genres.
To advance research in MIR, the developers of CLaMP 3 have also released WikiMT-X, a new benchmark dataset. WikiMT-X consists of 1,000 triplets of sheet music, audio, and diverse textual descriptions. This dataset serves as a basis for evaluating and comparing different MIR models and allows for an objective assessment of their performance.
In various experiments, CLaMP 3 has achieved state-of-the-art results, significantly outperforming previous models in several MIR tasks. The results demonstrate the model's power and its excellent generalizability in multimodal and multilingual music contexts. CLaMP 3 has the potential to fundamentally change the way we search, discover, and research music. Future research could focus on extending the model to further modalities, such as dance or visual representations of music. Integrating CLaMP 3 into existing music platforms and services could also lead to a significantly improved user experience.
Bibliography: - Wu, S., et al. "CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages." arXiv preprint arXiv:2502.10362 (2025). - https://arxiv.org/html/2502.10362v1 - https://synthical.com/article/CLaMP-3%3A-Universal-Music-Information-Retrieval-Across-Unaligned-Modalities-and-Unseen-Languages-44d5453c-275b-4e04-8aee-06c14ff67f92? - https://x.com/gm8xx8/status/1891388762359234855 - http://paperreading.club/page?id=284460 - https://www.researchgate.net/publication/385010430_CLaMP_2_Multimodal_Music_Information_Retrieval_Across_101_Languages_Using_Large_Language_Models