Zero-Shot Structure Labeling with Audio And Language Model Embeddings - Equipe Signal, Statistique et Apprentissage
[go: up one dir, main page]

Communication Dans Un Congrès Année : 2024
Zero-Shot Structure Labeling with Audio And Language Model Embeddings
1 S2A - Signal, Statistique et Apprentissage (Télécom Paris 19 Place Marguerite Perey 91120 Palaiseau - France)
"> S2A - Signal, Statistique et Apprentissage
2 IDS - Département Images, Données, Signal (46, rue Barrault 75013 Paris ; 15 Place Marguerite Perey 91120 Palaiseau (depuis oct 2019) - France)
"> IDS - Département Images, Données, Signal
3 MARL - Music and Audio Research Lab [New York] (70 Washington Square South, New York, NY 10012 - États-Unis)
"> MARL - Music and Audio Research Lab [New York]

Résumé

Recent progress on audio-based music structure analysis has closely aligned with the appearance of new deep learning paradigms, notably for the extraction of robust spectro-temporal audio features and their sequential modeling. However, most recent methods resort to supervised learning, which requires careful annotation of audio music pieces. Such annotations may sometimes operate at different temporal scales from one dataset to another or comprise inconsistent variation markers across repetitions of identical segments. This work explores language models as an alternative to manual pre-processing of the section label space, thus facilitating training and predictions across different annotated corpora. We propose a joint audio-to-text embedding space in which latent representations of audio frames and their respective section labels are close. We take inspiration from recent works on cross-modal contrastive learning and demonstrate the plausibility of this paradigm in the context of music structure analysis.
Fichier principal
Vignette du fichier
ISMIR24_LBD_ZeroShotSegmentation.pdf (322.27 Ko) Télécharger le fichier
Origine Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04764247 , version 1 (03-11-2024)

Licence

Identifiants
  • HAL Id : hal-04764247 , version 1

Citer

Morgan Buisson, Christopher Ick, Tom Xi, Brian McFee. Zero-Shot Structure Labeling with Audio And Language Model Embeddings. Extended Abstracts for the Late-Breaking Demo Session of the 25th International Society for Music Information Retrieval Conference (ISMIR), Nov 2024, San Francisco California, United States. ⟨hal-04764247⟩
104 Consultations
31 Téléchargements

Partager

More