Hello!
This Notebook is associated with the ICASSP2022 submission, presenting audio outputs of the Nonnegative Tucker Decomposition (NTD) when optimizing different loss functions. In particular, the three evaluated loss functions are three special cases of the more general $\beta$-divergence:
More details about our algorithm are to be found in the ICASSP submsission (which should be the reason of your presence on this page). Audio signals are obtained by applying the Griffin-Lim algorithm to STFT.
This notebook will present signals, showing results of:
Note though that signals representing songs will be limited to the first 16 bars, in order to limit the size of this HTML page.
We insist on the fact that, while audio signals are listenable, they are not of profesional musical quality either due to inaccuracies in the decomposition or due to the phase-estimation algorithm that we use (Griffin-Lim). Improving the reconstruction of these signals could constitute future work.
In the meantime, we believe that these audio examples are good examples of the potential and outputs of the NTD, and allow to qualitatively evaluate the differences between the different loss functions.
Let's start with importing external librairies (which are installed automatically if you used pip install
, otherwise you should install them manually).
# External imports
# Module for manipulating arrays
import numpy as np
# Module for loading signals
import soundfile as sf
# Module for manipulating signals, notably
import librosa
import IPython.display as ipd
And now, let's import the nn_fac
and MusicNTD
code (respectively code for Nonnegative Factorizations methods and for everything else (data manipulation, segmentation, etc) associated with NTD for music segmenation):
# Module containing our NTD resolution algorithm
import nn_fac.ntd as NTD
# Module encapsulating the computation of features from the signal
import musicntd.model.features as features
# General module for manipulating data: conversion between time, bars, frame indexes, loading of data, ...
import musicntd.data_manipulation as dm
# Module constructing the tensor, starting from the spectrogram
import musicntd.tensor_factory as tf
# Plotting module
from musicntd.model.current_plot import *
Next, we need to load the song to decompose. We used Come Together from The Beatles as example, but feel free to chose any song you'd like! (in wav though.)
NB: this comment only applies of you're compiling the Notebook, and not reading the HTML, as the HTML is static.
# Song
song_path = "C:/Users/amarmore/this_folder/The Beatles - Come Together.wav"
the_signal, sampling_rate = sf.read(song_path)
# Get the downbeats
bars = dm.get_bars_from_audio(song_path)
C:\Users\amarmore\AppData\Local\Continuum\anaconda3\envs\NTD_segmentation\lib\site-packages\scipy\io\wavfile.py:273: WavFileWarning: Chunk (non-data) not understood, skipping it. WavFileWarning)
Let's compute the STFT of the song:
n_fft=2048
hop_length = 32
stft_complex = librosa.core.stft(np.asfortranarray(the_signal[:,0]), n_fft=n_fft, hop_length = hop_length)
for i in range(1,the_signal.shape[1]):
stft_complex += librosa.core.stft(np.asfortranarray(the_signal[:,i]), n_fft=n_fft, hop_length = hop_length)
mag, phase = librosa.magphase(stft_complex, power=1) # Magnitude spectrogram
and then form the tensor-spectrogram of this STFT:
hop_length_seconds = hop_length / sampling_rate
subdivision = 96
tensor_stft = tf.tensorize_barwise(mag, bars, hop_length_seconds, subdivision)
We reconstruct the song from the unfolded tensor spectrogram. Hence, the song will be reconstructed from the 96 chosen samples per bar.
To reconstruct the song, the algorithm needs the hop length of the STFT. As bars can be of different length, we compute the median hop length from the different bars, and applies it to all bars in our song.
hops = []
for bar_idx in range(tensor_stft.shape[2]):
len_sig = bars[bar_idx+1][1] - bars[bar_idx+1][0]
hop = int(len_sig/96 * sampling_rate)
hops.append(hop)
median_hop = int(np.median(hops))
Now, let's recreate the signal from the barwise STFT, in order to study the reconstruction quality of the Griffin-Lim algorithm. We limit the song to a certain number of bars (not to overload the final HTML file).
nb_bars = 16 # you can set it to 89 if you use the executable format, and listen to the whole song.
time = nb_bars * subdivision
audio_stft = librosa.griffinlim(np.reshape(tensor_stft[:,:,:nb_bars], (1025, time), order = 'F'), hop_length = median_hop)
Let's hear it:
ipd.Audio(audio_stft, rate=sampling_rate)
We already see some artifacts coming from the reconstruction. Hence, reconstructed signals won't be better than this one, which is already disturbed.
Let's compute the NTD of this tensor-spectrogram, and study the reconstructed signal and the barwise patterns obtained in the decomposition.
As a recall, NTD is a tensor decomposition method, which can be used to retrieve patterns from data.
We refer to the ICASSP submission or to [1] for details.
First, we need to set the dimensions of the decomposition, corresponding to the core dimensions. They have set empirically here.
ranks = [32,24,12] #Dimensions of the decomposition
n_iter_max = 100
Below is executed the NTD with the HALS algorithm, optimizing the euclidean norm ($\beta$-divergence with $\beta = 2$) between the original and the reconstructed tensor.
core_beta2, factors_beta2 = NTD.ntd(tensor_stft, ranks = ranks, init = "tucker", verbose = False, deterministic = True,
sparsity_coefficients = [None, None, None, None], normalize = [True, True, False, True], mode_core_norm = 2, n_iter_max = n_iter_max)
Below is executed the NTD with the MU algorithm optimizing the Kullback-Leibler divergence ($\beta$-divergence with $\beta = 1$) between the original and the reconstructed tensor.
core_beta1, factors_beta1 = NTD.ntd_mu(tensor_stft, ranks = ranks, init = "tucker", verbose = False, deterministic = True, beta = 1,
sparsity_coefficients = [None, None, None, None], normalize = [True, True, False, True], mode_core_norm = 2, n_iter_max = n_iter_max)
Below is executed the NTD with the MU algorithm optimizing the Itakura-Saito divergence ($\beta$-divergence with $\beta = 0$) between the original and the reconstructed tensor.
core_beta0, factors_beta0 = NTD.ntd_mu(tensor_stft, ranks = ranks, init = "tucker", verbose = False, deterministic = True, beta = 0,
sparsity_coefficients = [None, None, None, None], normalize = [True, True, False, True], mode_core_norm = 2, n_iter_max = n_iter_max)
Having decomposed the song with the 3 different losses, we will now compare the resulting decompositions by listening to the resulting factorization.
Hence, we unfold the NTD results and use the Griffin-Lim algorithm to reconstruct a signal.
# function reconstructing the signal from the ntd results.
def reconstruct_song_from_ntd(core, factors, bars, nb_bars = None):
if nb_bars == None:
nb_bars = factors[2].shape[0]
barwise_spec_shape = (factors[0]@core[:,:,0]@factors[1].T).shape
signal_content = None
for bar_idx in range(nb_bars):
len_sig = bars[bar_idx+1][1] - bars[bar_idx+1][0]
hop = int(len_sig/96 * sampling_rate)
patterns_weights = factors[2][bar_idx]
bar_content = np.zeros(barwise_spec_shape)
for pat_idx in range(ranks[2]):
bar_content += patterns_weights[pat_idx] * factors[0]@core[:,:,pat_idx]@factors[1].T
signal_content = np.concatenate((signal_content, bar_content), axis=1) if signal_content is not None else bar_content
reconstructed_song = librosa.griffinlim(signal_content, hop_length = hop)
return reconstructed_song
audio_beta2 = reconstruct_song_from_ntd(core_beta2, factors_beta2, bars, nb_bars = nb_bars)
signal_beta2 = ipd.Audio(audio_beta2, rate=sampling_rate)
audio_beta1 = reconstruct_song_from_ntd(core_beta1, factors_beta1, bars, nb_bars = nb_bars)
signal_beta1 = ipd.Audio(audio_beta1, rate=sampling_rate)
audio_beta0 = reconstruct_song_from_ntd(core_beta0, factors_beta0, bars, nb_bars = nb_bars)
signal_beta0 = ipd.Audio(audio_beta0, rate=sampling_rate)
plot_audio_diff_beta_in_dataframe(signal_beta2, signal_beta1, signal_beta0)
beta = 2 | beta = 1 | beta = 0 |
---|---|---|