Aligning MIDI to audio

In order to receive audio-timed chord labels from a MIDI file, DECIBEL first finds an optimal alignment from the Midi file to the audio file, realigns the MIDI file using this alignment and then uses a MIDI chord recognizer to estimate the chord labels on the realigned MIDI file. The audio-midi aligner contains methods to re-align the MIDI file to the audio file.

For alignment between MIDI files and audio recordings, DECIBEL uses a DTW algorithm by Raffel and Ellis ([raffel2016optimizing]) Dynamic Time Warping (DTW) is a common technique to align two feature vectors, for example two representations of the same song.

Let us have a look at the outline of the algorithm. First, all MIDI files are synthesized using the fluidsynth software synthesizer with the FluidR3_GM soundfont. Now we have a waveform representation for both the audio and the MIDI file, as shown below:

Audio waveform

../_images/Audio-waveform.png

Synthesized MIDI waveform

../_images/Synth-MIDI-waveform.png

Note that our example MIDI file starts with silence, while in the audio recording the music starts immediately. Also, the MIDI file has a longer duration, as the MIDI file repeats the chorus an additional time, compared to the audio file. Then, the Constant-Q transform is calculated for both the audio and the synthesized MIDI waveform:

Audio CQT

../_images/Audio-cqt.png

Synthesized MIDI CQT

../_images/MIDI-cqt.png

Features are found by aggregation over the Constant-Q transform vectors. Then, the optimal path between the audio file and the synthesized MIDI is calculated using DTW. This results in an optimal path and the alignment confidence score:

Alignment path

../_images/Alignment.png

In this figure, we see that the alignment path starts not in the coordinate (0, 0), but a bit to the right: the silence at the start of the MIDI file is not mapped to any position in the audio file. The same goes for the end of the MIDI file, which is a superfluous repetition of the chorus. Finally, this alignment path is used to remap the MIDI file to the audio recording:

MIDI re-alignment

../_images/Alginment-mapping.png

Decibel uses the unchanged parameter setting reported in the paper by [raffel2016optimizing]:

Parameter

Setting

Feature representation

log-magnitude Constant-Q transform

Time scale

every 46 milliseconds

Cost function

cosine distance

Penalty

median distance of all pairs of frames

Gully

0.96

Band path constraint

none

Synthesize MIDI files

The decibel.audio_midi_aligner.synthesizer module contains functions for synthesizing MIDI files using the fluidsynth software synthesizer.

decibel.audio_midi_aligner.synthesizer.synthesize_midi_to_wav(midi_file_path_from: str, sampling_rate: int = 22050)[source]

Converts a midi file, specified to its path, to a waveform and writes the result as a wav file

Parameters
  • midi_file_path_from – Path to the midi file which will be converted

  • sampling_rate – Sampling rate of the audio

Aligning synthesized MIDI to audio

The decibel.audio_midi_aligner.aligner module contains functions for finding the alignment between the synthesized MIDI file and the audio recording. The align_midi() function (in the Alignment class) finds the alignment between a single MIDI file and the matched audio recording and returns an Alignment object. The function align_single_song() finds the alignment between the audio file of the given song and all matched MIDI files. Since this can take a long time, it is possible to interrupt and resume this process: all MIDI files that are aligned, are stored and can be reloaded quickly when they are needed in a new program run.

decibel.audio_midi_aligner.aligner.align_midi(audio_cqt: numpy.ndarray, audio_times: numpy.ndarray, full_synthesized_midi_path: str, full_alignment_write_path: str, alignment_parameters: Optional[decibel.audio_midi_aligner.alignment_parameters.AlignmentParameters] = None)[source]

Align audio (specified by CQT) to synthesized MIDI (specified by path), return path and score of the alignment

Parameters
  • alignment_parameters – Parameters for alignment

  • audio_cqt – The CQT of the audio of the alignment

  • audio_times – Array of times of the audio (from compute_cqt function)

  • full_synthesized_midi_path – The path to the synthesized MIDI file

  • full_alignment_write_path – The path to write the alignment to

decibel.audio_midi_aligner.aligner.align_single_song(song: decibel.music_objects.song.Song, alignment_parameters: Optional[decibel.audio_midi_aligner.alignment_parameters.AlignmentParameters] = None)[source]

Align each MIDI file that is matched to this song to the song. As part of the procedure, each MIDI will be synthesized and the alignment of each MIDI will be written to a file.

Parameters
  • alignment_parameters – Parameters for alignment

  • song – The Song object for which we align each MIDI file

raffel2016optimizing(1,2)

Raffel, Colin, and Daniel PW Ellis. “Optimizing DTW-based audio-to-MIDI alignment and matching.” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016.