Zero-Shot Structure Labeling with Audio And Language Model Embeddings

Communication Dans Un Congrès Année : 2024

(1, 2) , (3) , (3) , (3)

1 (Télécom Paris 19 Place Marguerite Perey 91120 Palaiseau - France) 554452

LTCI - Laboratoire Traitement et Communication de l'Information (Télécom Paris 19 Place Marguerite Perey 91120 PALAISEAU - France) 484335
- IMT - Institut Mines-Télécom [Paris] (37-39 Rue Dareau, 75014 Paris - France) 302102
- Télécom Paris (19 Place Marguerite Perey 91120 Palaiseau - France) 1048346
  - IMT - Institut Mines-Télécom [Paris] (37-39 Rue Dareau, 75014 Paris - France) 302102
  - IP Paris - Institut Polytechnique de Paris (Route de Saclay, 91120 Palaiseau Cedex, France - France) 563936

"> S2A - Signal, Statistique et Apprentissage
2 (46, rue Barrault 75013 Paris ; 15 Place Marguerite Perey 91120 Palaiseau (depuis oct 2019) - France) 554512

Télécom ParisTech (46 rue Barrault 75634 Paris Cedex 13 - France) 300362

"> IDS - Département Images, Données, Signal
3 (70 Washington Square South, New York, NY 10012 - États-Unis) 412464

NYU - New York University [New York] (70 Washington Square South, New York, NY 10012 - États-Unis) 300459
- NYU - NYU System (États-Unis) 566779

"> MARL - Music and Audio Research Lab [New York]

Morgan Buisson

Fonction : Auteur correspondant
PersonId : 1160385
IdHAL : morgan-buisson
ORCID : 0000-0001-8541-3071

Connectez-vous pour contacter l'auteur

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Christopher Ick

Fonction : Auteur correspondant

Music and Audio Research Lab [New York]

Tom Xi

Fonction : Auteur

Music and Audio Research Lab [New York]

Brian McFee

Fonction : Auteur
PersonId : 1435470

Music and Audio Research Lab [New York]

Résumé

Recent progress on audio-based music structure analysis has closely aligned with the appearance of new deep learning paradigms, notably for the extraction of robust spectro-temporal audio features and their sequential modeling. However, most recent methods resort to supervised learning, which requires careful annotation of audio music pieces. Such annotations may sometimes operate at different temporal scales from one dataset to another or comprise inconsistent variation markers across repetitions of identical segments. This work explores language models as an alternative to manual pre-processing of the section label space, thus facilitating training and predictions across different annotated corpora. We propose a joint audio-to-text embedding space in which latent representations of audio frames and their respective section labels are close. We take inspiration from recent works on cross-modal contrastive learning and demonstrate the plausibility of this paradigm in the context of music structure analysis.

Domaines

Intelligence artificielle [cs.AI]

Fichier principal

ISMIR24_LBD_ZeroShotSegmentation.pdf (322.27 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Morgan Buisson : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04764247

Soumis le : dimanche 3 novembre 2024-20:08:29

Dernière modification le : mercredi 13 novembre 2024-12:14:36

Dates et versions

hal-04764247 , version 1 (03-11-2024)

Licence

Paternité

Identifiants

HAL Id : hal-04764247 , version 1

Citer

Morgan Buisson, Christopher Ick, Tom Xi, Brian McFee. Zero-Shot Structure Labeling with Audio And Language Model Embeddings. Extended Abstracts for the Late-Breaking Demo Session of the 25th International Society for Music Information Retrieval Conference (ISMIR), Nov 2024, San Francisco California, United States. ⟨hal-04764247⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

LTCI IDS S2A IP_PARIS INSTITUT-MINES-TELECOM

104 Consultations

31 Téléchargements

Résumé

Domaines

Dates et versions

Licence

Citer

Exporter

Collections

Partager