In [2]:

```
import pandas as pd
import numpy as np
import soundfile as sf
from musicntd.model.current_plot import *
import musicntd.data_manipulation as dm
import musicntd.autosimilarity_segmentation as as_seg
import musicntd.tensor_factory as tf
import musicntd.model.features as features
import musicntd.scripts.hide_code as hide
import nn_fac.ntd as NTD
```

This notebook aims at studying different representations for music, and at finding the more interesting for our context (description of the signal in NTD).

Here, we will compare:

- STFT: Short-Time Fourier Transform,
- CQT: Constant-Q Transform,
- PCP: Pitch-Class Profiles, or Chromas.

This notebook is organised as follows:

- Firstly, we will plot these representation on a song (1.wav from RWC Pop),
- Secondly, we will see NTD results on these previously computed representations, and segment them while presenting two segmentation methods in the same time,
- Finally, we will decompose and segment the entire RWC Pop dataset with fixed parameters, so as to compare these representations quantitatively.

In [4]:

```
dataset_path = "C:\\Users\\amarmore\\Desktop\\Audio samples\\RWC Pop\\Entire RWC"
song_number = 1
song_path = dataset_path + "\\{}.wav".format(song_number)
# Choice of the annotation
annotations_type = "MIREX10"
annotations_folder = "C:\\Users\\amarmore\\Desktop\\Audio samples\\RWC Pop\\annotations\\{}\\".format(annotations_type)
annotation_path = annotations_folder + dm.get_annotation_name_from_song(song_number, annotations_type)
```

In [5]:

```
hop_length = 512
n_fft = hop_length * 4
the_signal, sampling_rate = sf.read(song_path)
hop_length_seconds = hop_length/sampling_rate
```

In [7]:

```
stft_spec = features.get_spectrogram(the_signal[:,0], sampling_rate,
feature="stft", n_fft = n_fft, hop_length = hop_length)
plot_me_this_spectrogram(stft_spec, invert_y_axis = False, x_axis = "Time (in number of frames)", y_axis = "Frequency (in indexes of bandwidths)")
```

In this example, low frequencies seem to dominate the decomposition. This is related to acoustic properties of the human ear, where high frequencies are perceived stronger than low ones at a same sound intensity. In that sense, to be perceived equally, low frequencies are usually more intense than high frequencies.

In [5]:

```
cqt_spec = features.get_spectrogram(the_signal[:,0], sampling_rate,
feature="cqt", n_fft = n_fft, hop_length = hop_length)
plot_me_this_spectrogram(cqt_spec, invert_y_axis = False, x_axis = "Time (in number of frames)", y_axis = "Index of Constant-Q bandwidth")
```

CQT seems to handle more mid-frequencies information then STFT, which, empirically, seems desirable.

Note that the 0 in the y-axis refer to the first Constant-Q bandwidth and not a "0 note". Bandwidths are aligned with midi notes, and the first bandwidth (0) is the 24-th midi note (C1). Hence, bandwidths should be considered with an offset of 24 to be converted in midi scale.

In [7]:

```
pcp_spec = features.get_spectrogram(the_signal[:,0], sampling_rate,
feature="pcp", n_fft = n_fft, hop_length = hop_length)
plot_me_this_spectrogram(pcp_spec, invert_y_axis = False, x_axis = "Time (in number of frames)", y_axis = "Note index (in pitch-class, semi-tone spaced)")
```

PCP represent the harmonic content in the song, and discard the percussive content.

In this implementation, PCP are computed from the CQT of the signal.

Now, we will compute NTD on these three examples, and compare the resulting segmentations.

Firstly, we will load the annotation.

In [5]:

```
# Loading and formatting annotations
annotations = dm.get_segmentation_from_txt(annotation_path, annotations_type)
references_segments = np.array(annotations)[:,0:2]
```

In order to perform NTD on these examples, we need to transform the chromagrams in "Time-Frequency-Bar tensor", *i.e.* tensors were time is divided in two scales:

- one for the inner time of bars,
- one representing each bar in the song.

(see the article "Uncovering Audio Patterns in Music with Nonnegative Tucker Decomposition For Structural Segmentation", soon published at the time I write these lines, for detailled explanation of this tensor).

In order to construct our tensor, we first need to estimate the downbeats. This is made via the madmom toolbox [4].

In [6]:

```
# Estimate the downbeats/bar frontiers
bars = dm.get_bars_from_audio(song_path)
# Convert the annotation in the bar scale: this is used for plotting the annotation on our figures.
annotations_frontiers_barwise = dm.frontiers_from_time_to_bar(np.array(annotations)[:,1], bars)
```

We then cut the chromagram at each downbeat, in order to have a collection of chromagrams, one for each bar.

This collection of chromagrams will then form the tensor. A problem though is that bars can be of different length, resulting in chromagrams of different length. This is a problem, as tensor need slices (*i.e.* the chromagrams taken individually) of the same size.

In these experimentations, we decided to search for the longest bar, use its length (in number of frames) as the default dimension for all bars, and to zero-pad the ones who were shorter, as in [2].

**This method was then changed, see the 3rd notebook for details.**

In [9]:

```
stft_tensor_spectrogram = tf.tensorize_barwise(stft_spec, bars, hop_length_seconds)
cqt_tensor_spectrogram = tf.tensorize_barwise(cqt_spec, bars, hop_length_seconds)
pcp_tensor_spectrogram = tf.tensorize_barwise(pcp_spec, bars, hop_length_seconds)
```

In [10]:

```
# One particular slice of the tensors, representing a particular bar (48-th bar exactly).
idx = 48
hide.slice_tensor_spectrogram(idx, stft_tensor_spectrogram,cqt_tensor_spectrogram,pcp_tensor_spectrogram)
```

*NB: note that these bars contain blank frames at the end. This is because, at the time of these experimentation, bars weren't fitted to hold the same number of frames, but rather the size of the largest bar, and smaller bars were padded with zero frames, as in [2].*

In [8]:

```
# Rank selection
ranks = [32,32,32]
```

Now, we decompose these tensors by NTD.

Firstly, we will print factor matrices resulting of this decomposition, and secondly, we will print the autosimilarity of the $Q$ (and normalized $Q$) matrices, along with the autosimilarity of the signal, for comparison.

In [12]:

```
stft_core, stft_factors = NTD.ntd(stft_tensor_spectrogram, ranks = ranks, init = "tucker", verbose = False, hals = False,
sparsity_coefficients = [None, None, None, None], normalize = [True, True, False, True])
```

In [13]:

```
hide.nice_plot_factors(stft_factors)
```

In [14]:

```
hide.nice_plot_autosimilarities(stft_factors[2], stft_tensor_spectrogram, annotations_frontiers_barwise)
```

In [15]:

```
cqt_core, cqt_factors = NTD.ntd(cqt_tensor_spectrogram, ranks = ranks, init = "tucker", verbose = False, hals = False,
sparsity_coefficients = [None, None, None, None], normalize = [True, True, False, True])
```

In [16]:

```
hide.nice_plot_factors(cqt_factors)
```

In [17]:

```
hide.nice_plot_autosimilarities(cqt_factors[2], cqt_tensor_spectrogram, annotations_frontiers_barwise)
```

NB: Chroma dimension being 12, rank of $W$ shouldn't exceed 12 as it will probably create redundancy in the factors, and be counter-productive to a salient decomposition. It is then fixed to 12.

In [18]:

```
pcp_core, pcp_factors = NTD.ntd(pcp_tensor_spectrogram, ranks = [12, 32, 32], init = "tucker", verbose = False, hals = False,
sparsity_coefficients = [None, None, None, None], normalize = [True, True, False, True])
```

In [19]:

```
hide.nice_plot_factors(pcp_factors)
```

In [20]:

```
hide.nice_plot_autosimilarities(pcp_factors[2], pcp_tensor_spectrogram, annotations_frontiers_barwise)
```