Evaluation

In order to evaluate both the performance of DECIBEL’s representation-specific subsystems and its final output chord sequence, we need evaluation measures. The quality of a chord sequence is usually determined by comparing it to a ground truth created by one or more human annotators. Commonly used data sets with chord annotations, which are also used in the MIREX ACE contest, are Isophonics, Billboard, RobbieWilliams, RWC-Popular and USPOP2002Chords. DECIBEL uses the Isophonics data set, augmented with matched MIDI and tab files.

The standard quality measure to evaluate the quality of an automatic transcription is chord symbol recall (CSR). This measure is also used in MIREX. CSR is the summed duration of time periods where the correct chord has been identified, normalized by the total duration of the song. Until 2013, MRIEX used an approximate, frame-based CSR calculated by sampling both the ground-truth and the automatic annotations every 10 ms and dividing the number of correctly annotated samples by the total number of samples. Since 2013, MIREX has used segment-based CSR, which is more precise and computationally more efficient.

For results that are calculated for the whole data set, we weigh the CSR by the length of the song when computing an average for a given corpus. This final number is referred to as the weighted chord symbol recall (WCSR). Calculating the WCSR is basically the same as treating the data set as one big audio file, and calculating the CSR between the concatenation of all ground-truth annotations and the concatenation of all estimated annotations.

The CSR correctly indicates the accuracy of an ACE algorithm in terms of whether the estimated chord for a given instant in the audio is correct. It it therefore widely used in the evaluation of ACE systems. However, the annotation with the highest CSR is not always the annotation that would be considered the best by human listeners. For this purpose, we also use measures based on the directional hamming distance, which describes how fragmented a chord segmentation is with respect to the ground truth chord segmentation.

decibel.evaluator.evaluator.evaluate(ground_truth_lab_path, my_lab_path)[source]

Evaluate the chord label sequence in my_lab_path, compared to the ground truth sequence in ground_truth_lab_path

Parameters
  • ground_truth_lab_path – Path to .lab file of ground truth chord label sequence

  • my_lab_path – Path to .lab file of estimated chord label sequence

Returns

CSR, over-segmentation, under-segmentation, segmentation

decibel.evaluator.evaluator.evaluate_method(all_songs, method_name, get_lab_function)[source]

Evaluate all songs from our data set for one specific chord estimation technique, for which you get the labels using get_lab_function

Parameters
  • all_songs – All songs in our data set

  • method_name – Name of the method (e.g. ‘CHF_2017_DF_BEST’)

  • get_lab_function – A function that takes the song and outputs the lab path

Returns

Pandas DataFrame with results

decibel.evaluator.evaluator.evaluate_midis(all_songs)None[source]

Evaluate all lab files based on MIDI alignment and chord estimation

Parameters

all_songs – All songs in the data set

decibel.evaluator.evaluator.evaluate_song_based(all_songs)[source]

Evaluate all songs in the data set in parallel

Parameters

all_songs – All song in the data set

Returns

Print statement indicating that the evaluation was finished

decibel.evaluator.evaluator.evaluate_tabs(all_songs)None[source]

Evaluate all lab files based on tab parsing and alignment.

Parameters

all_songs – All songs in our data set.

decibel.evaluator.evaluator.write_method_evaluations(all_songs, method_name, get_lab_function)[source]

Write evaluations for all songs from our data set that have not been evaluated yet.

Parameters
  • all_songs – All songs in our data set

  • method_name – Name of the method (e.g. ‘CHF_2017_DF_BEST’)

  • get_lab_function – A function that takes the song and outputs the lab path