Communication Dans Un Congrès
Année : 2024
Résumé
Recent progress on audio-based music structure analysis has closely aligned with the appearance of new deep learning paradigms, notably for the extraction of robust spectro-temporal audio features and their sequential modeling. However, most recent methods resort to supervised learning, which requires careful annotation of audio music pieces. Such annotations may sometimes operate at different temporal scales from one dataset to another or comprise inconsistent variation markers across repetitions of identical segments. This work explores language models as an alternative to manual pre-processing of the section label space, thus facilitating training and predictions across different annotated corpora. We propose a joint audio-to-text embedding space in which latent representations of audio frames and their respective section labels are close. We take inspiration from recent works on cross-modal contrastive learning and demonstrate the plausibility of this paradigm in the context of music structure analysis.
Domaines
Intelligence artificielle [cs.AI]Origine | Fichiers produits par l'(les) auteur(s) |
---|
Morgan Buisson : Connectez-vous pour contacter le contributeur
https://hal.science/hal-04764247
Soumis le : dimanche 3 novembre 2024-20:08:29
Dernière modification le : mercredi 13 novembre 2024-12:14:36
Dates et versions
Licence
- HAL Id : hal-04764247 , version 1
Citer
Morgan Buisson, Christopher Ick, Tom Xi, Brian McFee. Zero-Shot Structure Labeling with Audio And Language Model Embeddings. Extended Abstracts for the Late-Breaking Demo Session of the 25th International Society for Music Information Retrieval Conference (ISMIR), Nov 2024, San Francisco California, United States. ⟨hal-04764247⟩
Collections
104
Consultations
31
Téléchargements