publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2026
- International Conf.TensLoRA: Tensor Alternatives for Low-Rank AdaptationIn 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2026
Low-Rank Adaptation (LoRA) is widely used to efficiently adapt Transformers by adding trainable low-rank matrices to attention projections. While effective, these matrices are considered independent for each attention projection (Query, Key, and Value) and each layer. Recent extensions have considered joint, tensor-based adaptations, but only in limited forms and without a systematic framework. We introduce TensLoRA, a unified framework that aggregates LoRA updates into higher-order tensors and models a broad family of tensor-based low-rank adaptations. Our formulation generalizes existing tensor-based methods and enables mode-specific compression rates, allowing parameter budgets to be tailored according to the modality and task. Experiments on vision and language benchmarks reveal that the tensor construction directly impacts performance, sometimes better than standard LoRA under similar parameter counts.
@inproceedings{marmoret2026tenslora, title = {TensLoRA: Tensor Alternatives for Low-Rank Adaptation}, author = {Marmoret, Axel and Bensaid, Reda and Lys, Jonathan and Gripon, Vincent and Leduc-Primeau, Fran{\c{c}}ois}, booktitle = {2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}, year = {2026}, month = may, organization = {IEEE}, url = {https://arxiv.org/abs/2509.19391}, } - PreprintResidual Connections and the Causal Shift: Uncovering a Structural Misalignment in TransformersarXiv preprint arXiv:2602.14760. Submitted at EUSIPCO 2026 , 2026
Large Language Models (LLMs) are trained with next-token prediction, implemented in autoregressive Transformers via causal masking for parallelism. This creates a subtle misalignment: residual connections tie activations to the current token, while supervision targets the next token, potentially propagating mismatched information if the current token is not the most informative for prediction. In this work, we empirically localize this input-output alignment shift in pretrained LLMs, using decoding trajectories over tied embedding spaces and similarity-based metrics. Our experiments reveal that the hidden token representations switch from input alignment to output alignment deep within the network. Motivated by this observation, we propose a lightweight residual-path mitigation based on residual attenuation, implemented either as a fixed-layer intervention or as a learnable gating mechanism. Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment and yield improvements, providing an efficient and general architectural enhancement for autoregressive Transformers.
@article{lys2026residual, title = {Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers}, author = {Lys, Jonathan and Gripon, Vincent and Pasdeloup, Bastien and Marmoret, Axel and Mauch, Lukas and Cardinaux, Fabien and Hacene, Ghouthi Boukli}, journal = {arXiv preprint arXiv:2602.14760}, year = {2026}, month = {}, } - PreprintInner Loop Inference for Pretrained Transformers: Unlocking Latent Capabilities Without TrainingarXiv preprint arXiv:2602.14759. Submitted at EUSIPCO 2026 , 2026
Deep Learning architectures, and in particular Transformers, are conventionally viewed as a composition of layers. These layers are actually often obtained as the sum of two contributions: a residual path that copies the input and the output of a Transformer block. As a consequence, the inner representations (i.e. the input of these blocks) can be interpreted as iterative refinement of a propagated latent representation. Under this lens, many works suggest that the inner space is shared across layers, meaning that tokens can be decoded at early stages. Mechanistic interpretability even goes further by conjecturing that some layers act as refinement layers. Following this path, we propose inference-time inner looping, which prolongs refinement in pretrained off-the-shelf language models by repeatedly re-applying a selected block range. Across multiple benchmarks, inner looping yields modest but consistent accuracy improvements. Analyses of the resulting latent trajectories suggest more stable state evolution and continued semantic refinement. Overall, our results suggest that additional refinement can be obtained through simple test-time looping, extending computation in frozen pretrained models.
@article{lys2026inner, title = {Inner Loop Inference for Pretrained Transformers: Unlocking Latent Capabilities Without Training}, author = {Lys, Jonathan and Gripon, Vincent and Pasdeloup, Bastien and Marmoret, Axel and Mauch, Lukas and Cardinaux, Fabien and Hacene, Ghouthi Boukli}, journal = {arXiv preprint arXiv:2602.14759}, year = {2026}, month = {}, }
2025
- National Conf.AutoMashup: Automatic Music Mashups CreationMarine Delabaere*, Léa Miqueu*, Michael Moreno*, Gautier Bigois*, and 7 more authorsIn GRETSI’25 - XXXe Colloque Francophone de Traitement du Signal et des Images, 2025
We introduce AutoMashup, a system for automatic mashup creation based on source separation, music analysis, and compatibility estimation. We propose using COCOLA to assess compatibility between separated stems and investigate whether general-purpose pretrained audio models (CLAP and MERT) can support zero-shot estimation of track pair compatibility. Our results show that mashup compatibility is asymmetric — it depends on the role assigned to each track (vocals or accompaniment) — and that current embeddings fail to reproduce the perceptual coherence measured by COCOLA. These findings underline the limitations of general-purpose audio representations for compatibility estimation in mashup creation.
@inproceedings{delabaere2025automashup, title = {{AutoMashup: Automatic Music Mashups Creation}}, author = {Delabaere, Marine and Miqueu, L{\'e}a and Moreno, Michael and Bigois, Gautier and Duong, Hoang and Fernandez, Ella and Manent, Flavie and Salgado-Herrera, Maria and Pasdeloup, Bastien and Farrugia, Nicolas and Marmoret, Axel}, url = {https://hal.science/hal-05191030}, booktitle = {{GRETSI'25 - XXXe Colloque Francophone de Traitement du Signal et des Images}}, year = {2025}, month = {}, } - National Conf.Raining Words : Les modèles d’ASR peuvent-ils retranscrire les sous-genres de Metal ?Bastien Pasdeloup* and Axel Marmoret*In GRETSI’25 - XXXe Colloque Francophone de Traitement du Signal et des Images, 2025
Extreme vocal styles in Metal music are known for their intensity, vocal saturation, and low intelligibility. In this paper, we evaluate the ability of recent Automatic Speech Recognition (ASR) models to transcribe such vocals in an out-of-distribution (OOD) setting. We assess five state-of-the-art ASR models using two types of data: isolated extreme vocal recordings and full metal songs from various subgenres. We also apply source separation techniques to extract vocals from the music tracks. Our results show that these models struggle to accurately transcribe extreme vocals, especially in cases of severe vocal distortion or atypical prosody.
@inproceedings{pasdeloup2025raining, title = {{Raining Words : Les modèles d'ASR peuvent-ils retranscrire les sous-genres de Metal ?}}, author = {Pasdeloup, Bastien and Marmoret, Axel}, booktitle = {{GRETSI'25 - XXXe Colloque Francophone de Traitement du Signal et des Images}}, year = {2025}, month = {}, url = {https://hal.science/hal-05191118}, } - National Conf.Apprentissage par transfert pour la détection et la classification automatiques de grandes baleines dans l’océan australIn GRETSI’25 - XXXe Colloque Francophone de Traitement du Signal et des Images, 2025
Automatic detection and classification of cetacean vocalizations in passive acoustic recordings are complex tasks. While convolutional neural networks are widely used, their generalization is often constrained by scarce annotated data and high recording variability (geographic, temporal, equipment-related, etc.). Transfer learning, leveraging larger pretrained networks, offers a potential solution. In this context, we investigated the Perch encoder and evaluated it using metrics ensuring a fair comparison.
@inproceedings{jean2025apprentissage, title = {{Apprentissage par transfert pour la détection et la classification automatiques de grandes baleines dans l'océan austral}}, author = {Jean-Labadye, Lucie and Dubus, Gabriel and Cazau, Dorian and Farrugia, Nicolas and Marmoret, Axel and Adam, Olivier}, booktitle = {{GRETSI'25 - XXXe Colloque Francophone de Traitement du Signal et des Images}}, year = {2025}, month = {}, url = {https://hal.science/hal-05201458}, }
2023
- JournalBarwise Music Structure Analysis with the Correlation Block-Matching Segmentation AlgorithmAxel Marmoret*†, Jérémy E Cohen‡, and Frédéric Bimbot†Transactions of the International Society for Music Information Retrieval, Nov 2023
Music Structure Analysis (MSA) is a Music Information Retrieval task consisting of representing a song in a simplified, organized manner by breaking it down into sections typically corresponding to ’chorus’, ’verse’, ’solo’, etc. In this work, we extend an MSA algorithm called the Correlation Block-Matching (CBM) algorithm introduced by (Marmoret et al., 2020, 2022b). The CBM algorithm is a dynamic programming algorithm that segments self-similarity matrices, which are a standard description used in MSA and in numerous other applications. In this work, self-similarity matrices are computed from the feature representation of an audio signal and time is sampled at the bar-scale. This study examines three different standard similarity functions for the computation of self-similarity matrices. Results show that, in optimal conditions, the proposed algorithm achieves a level of performance which is competitive with supervised state-of-the-art methods while only requiring knowledge of bar positions. In addition, the algorithm is made open-source and is highly customizable.
@article{marmoret2023barwise, author = {Marmoret, Axel and Cohen, Jérémy E and Bimbot, Fr{\'e}d{\'e}ric}, doi = {10.5334/tismir.167}, journal = {Transactions of the International Society for Music Information Retrieval}, volume = {6}, number = {1}, pages = {167--185}, title = {Barwise Music Structure Analysis with the Correlation Block-Matching Segmentation Algorithm}, month = nov, year = {2023}, url = {https://transactions.ismir.net/articles/10.5334/tismir.167}, } - International Conf.Convolutive block-matching segmentation algorithm with application to music structure analysisAxel Marmoret*†, Jérémy E Cohen‡, and Frédéric Bimbot†In 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023
Music Structure Analysis (MSA) consists of representing a song in sections (such as ’chorus’, ’verse’, ’solo’ etc), and can be seen as the retrieval of a simplified organization of the song. This work presents a new algorithm, called Convolutive Block-Matching (CBM) algorithm, devoted to MSA. In particular, the CBM algorithm is a dynamic programming algorithm, applying on autosimilarity matrices, a standard tool in MSA. In this work, autosimilarity matrices are computed from the feature representation of an audio signal, and time is sampled on the barscale. We study three different similarity functions for the computation of autosimilarity matrices. We report that the proposed algorithm achieves a level of performance competitive to that of supervised state-of-the-art methods on 3 among 4 metrics, while being fully unsupervised.
@inproceedings{marmoret2023convolutive, title = {Convolutive block-matching segmentation algorithm with application to music structure analysis}, author = {Marmoret, Axel and Cohen, Jérémy E and Bimbot, Fr{\'e}d{\'e}ric}, booktitle = {2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, year = {2023}, month = {}, organization = {IEEE}, doi = {10.1109/WASPAA58266.2023.10248174}, url = {https://hal.science/hal-03834996}, }
2022
- International Conf.Semi-Supervised Convolutive NMF for Automatic Music TranscriptionHaoran Wu, Axel Marmoret†, and Jérémy E Cohen‡In Proceedings of the 19th Sound and Music Computing Conference, 2022
Automatic Music Transcription, which consists in transforming an audio recording of a musical performance into symbolic format, remains a difficult Music Information Retrieval task. In this work, which focuses on piano transcription, we propose a semi-supervised approach using low-rank matrix factorization techniques, in particular Convolutive Nonnegative Matrix Factorization. In the semi-supervised setting, only a single recording of each individual notes is required. We show on the MAPS dataset that the proposed semi-supervised CNMF method performs better than state-of-the-art low-rank factorization techniques and a little worse than supervised deep learning state-of-the-art methods, while however suffering from generalization issues.
@inproceedings{wu2022semi, title = {Semi-Supervised Convolutive NMF for Automatic Music Transcription}, author = {Wu, Haoran and Marmoret, Axel and Cohen, J{\'e}r{\'e}my E}, booktitle = {Proceedings of the 19th Sound and Music Computing Conference}, year = {2022}, month = {}, doi = {10.5281/zenodo.6798192}, url = {https://hal.science/hal-03608497}, } - International Conf.Barwise Compression Schemes for Audio-Based Music Structure AnalysisAxel Marmoret*, Jérémy E Cohen†, and Frédéric Bimbot*In Proceedings of the 19th Sound and Music Computing Conference, 2022
Music Structure Analysis (MSA) consists in segmenting a music piece in several distinct sections. We approach MSA within a compression framework, under the hypothesis that the structure is more easily revealed by a simplified representation of the original content of the song. More specifically, under the hypothesis that MSA is correlated with similarities occurring at the bar scale, this article introduces the use of linear and non-linear compression schemes on barwise audio signals. Compressed representations capture the most salient components of the different bars in the song and are then used to infer the song structure using a dynamic programming algorithm. This work explores both low-rank approximation models such as Principal Component Analysis or Nonnegative Matrix Factorization and “piece-specific” Auto-Encoding Neural Networks, with the objective to learn latent representations specific to a given song. Such approaches do not rely on supervision nor annotations, which are well-known to be tedious to collect and possibly ambiguous in MSA description. In our experiments, several unsupervised compression schemes achieve a level of performance comparable to that of state-of-the-art supervised methods (for 3s tolerance) on the RWC-Pop dataset, showcasing the importance of the barwise compression processing for MSA.
@inproceedings{marmoret2022barwise, title = {Barwise Compression Schemes for Audio-Based Music Structure Analysis}, author = {Marmoret, Axel and Cohen, J{\'e}r{\'e}my E and Bimbot, Fr{\'e}d{\'e}ric}, booktitle = {{Proceedings of the 19th Sound and Music Computing Conference}}, year = {2022}, month = {}, doi = {10.5281/zenodo.6798330}, url = {https://hal.science/hal-03600873}, } - National Conf.Nonnegative tucker decomposition with beta-divergence for music structure analysis of audio signalsIn XXVIIIème Colloque Francophone de Traitement du Signal et des Images (GRETSI 2022), 2022
Nonnegative Tucker Decomposition (NTD), a tensor decomposition model, has received increased interest in the recent years because of its ability to blindly extract meaningful patterns in tensor data. Nevertheless, existing algorithms to compute NTD are mostly designed for the Euclidean loss. On the other hand, NTD has recently proven to be a powerful tool in Music Information Retrieval. This work proposes a Multiplicative Updates algorithm to compute NTD with the beta-divergence loss, often considered a better loss for audio processing. We notably show how to implement efficiently the multiplicative rules using tensor algebra, a naive approach being intractable. Finally, we show on a Music Structure Analysis task that unsupervised NTD fitted with beta-divergence loss outperforms earlier results obtained with the Euclidean loss.
@inproceedings{marmoret2022nonnegative, title = {Nonnegative tucker decomposition with beta-divergence for music structure analysis of audio signals}, author = {Marmoret, Axel and Voorwinden, Florian and Leplat, Valentin and Cohen, J{\'e}r{\'e}my E and Bimbot, Fr{\'e}d{\'e}ric}, booktitle = {XXVIII{\`e}me Colloque Francophone de Traitement du Signal et des Images (GRETSI 2022)}, year = {2022}, month = {}, publisher = {GRETSI - Groupe de Recherche en Traitement du Signal et des Images}, number = {001-0233}, pages = {933--936}, url = {https://hal.science/hal-03409508}, } - Ph.D. ThesisUnsupervised machine learning paradigms for the representation of music similarity and structureAxel Marmoret*Université de Rennes. Segmentation code can be found here, low-rank factorization code can be found here, and the specific methods for compressing music barwise can be found here , Dec 2022
Musical structure, defined as a simplified representation of the organization of a song, is an important musicological concept, but hard to automatically estimate. This thesis presents new methods to automatically estimate the structural segmentation of a song, focusing the study of music at the barscale. By developing a new segmentation algorithm (called “CBM”) and by comparing several unsupervised compression schemes (from linear and multilinear algebra to neural networks), paradigms introduced in this thesis result in segmentation performance outperforming those of the unsupervised State-of-the-Art methods and almost similar with those of the global State-of-the-Art, obtained with supervised machine learning algorithms. In particular, as the methods described in this thesis are unsupervised, the estimation do not rely on annotated data, lowering the bias in the estimates related to ambiguity and subjectivity (inherent to musical structure) while limiting the loss in performance compared to the best supervised methods. In addition, some of the methods studied in this thesis (in particular Nonnegative Tucker Decomposition) allow to extract automatically interpretable parts of a song which may be used for other task than the estimation of structure, and participate in the development of interpretable machine and deep learning algorithms, which is a major field of research nowadays.
@phdthesis{marmoret2022unsupervised, title = {Unsupervised machine learning paradigms for the representation of music similarity and structure}, author = {Marmoret, Axel}, year = {2022}, month = dec, school = {Universit{\'e} de Rennes}, url = {https://hal.science/tel-03937846}, }
2021
- PreprintPolytopic Analysis of MusicAxel Marmoret*, Jérémy E Cohen*, and Frédéric Bimbot*arXiv preprint arXiv:2212.11054, 2021
Structural segmentation of music refers to the task of finding a symbolic representation of the organisation of a song, reducing the musical flow to a partition of non-overlapping segments. Under this definition, the musical structure may not be unique, and may even be ambiguous. One way to resolve that ambiguity is to see this task as a compression process, and to consider the musical structure as the optimization of a given compression criteria. In that viewpoint, C. Guichaoua developed a compression-driven model for retrieving the musical structure, based on the "System and Contrast" model, and on polytopes, which are extension of nhypercubes. We present this model, which we call "polytopic analysis of music", along with a new opensource dedicated toolbox called MusicOnPolytopes (in Python). This model is also extended to the use of the Tonnetz as a relation system. Structural segmentation experiments are conducted on the RWC Pop dataset. Results show improvements compared to the previous ones, presented by C. Guichaoua.
@article{marmoret2022polytopic, title = {Polytopic Analysis of Music}, author = {Marmoret, Axel and Cohen, J{\'e}r{\'e}my E and Bimbot, Fr{\'e}d{\'e}ric}, journal = {arXiv preprint arXiv:2212.11054}, year = {2021}, month = {}, }
2020
- International Conf.Uncovering audio patterns in music with nonnegative Tucker decomposition for structural segmentationIn ISMIR 2020-21st International Society for Music Information Retrieval, Oct 2020
Recent work has proposed the use of tensor decomposition to model repetitions and to separate tracks in loop-based electronic music. The present work investigates further on the ability of Nonnegative Tucker Decompositon (NTD) to uncover musical patterns and structure in pop songs in their audio form. Exploiting the fact that NTD tends to express the content of bars as linear combinations of a few patterns, we illustrate the ability of the decomposition to capture and single out repeated motifs in the corresponding compressed space, which can be interpreted from a musical viewpoint. The resulting features also turn out to be efficient for structural segmentation, leading to experimental results on the RWC Pop data set which are potentially challenging state-of-the-art approaches that rely on extensive example-based learning schemes.
@inproceedings{marmoret2020uncovering, title = {Uncovering audio patterns in music with nonnegative Tucker decomposition for structural segmentation}, author = {Marmoret, Axel and Cohen, J{\'e}r{\'e}my E and Bertin, Nancy and Bimbot, Fr{\'e}d{\'e}ric}, booktitle = {ISMIR 2020-21st International Society for Music Information Retrieval}, year = {2020}, month = oct, pages = {788--794}, url = {https://hal.science/hal-02928733v1}, }
2019
- Master’s ThesisMulti-Channel Automatic Music Transcription Using Tensor AlgebraAxel Marmoret*, Nancy Bertin*, and Jérémy E Cohen*arXiv preprint arXiv:2107.11250, 2019
Music is an art, perceived in unique ways by every listener, coming from acoustic signals. In the meantime, standards as musical scores exist to describe it. Even if humans can make this transcription, it is costly in terms of time and efforts, even more with the explosion of information consecutively to the rise of the Internet. In that sense, researches are driven in the direction of Automatic Music Transcription. While this task is considered solved in the case of single notes, it is still open when notes superpose themselves, forming chords. This report aims at developing some of the existing techniques towards Music Transcription, particularly matrix factorization, and introducing the concept of multi-channel automatic music transcription. This concept will be explored with mathematical objects called tensors.
@article{marmoret2019multi, title = {Multi-Channel Automatic Music Transcription Using Tensor Algebra}, author = {Marmoret, Axel and Bertin, Nancy and Cohen, Jérémy E}, journal = {arXiv preprint arXiv:2107.11250}, year = {2019}, month = {}, url = {https://hal.science/hal-03301448}, }