Described below are systems and methods for transcribing piano music. Particular embodiments may employ an efficient algorithm for convolution sparse coding to transcribe the piano music.
Automatic music transcription (AMT) is the process of automatically inferring a high-level symbolic representation, such as music notation or piano-roll, from a music performance [1]. It has several applications in music education (e.g., providing feedback to a piano learner), content-based music search (e.g., searching songs with similar bassline), musicological analysis of non-notated music (e.g., Jazz improvisations and non-western music), and music enjoyment (e.g., visualizing the music content).
Music transcription of polyphonic music is a challenging task even for humans. It is related to ear training, a required course for professional musicians on identifying pitches, intervals, chords, melodies, rhythms, and instruments of music solely by hearing. AMT for polyphonic music was first proposed in 1977 by Moorer [2], and Piszczalski and Galler [3]. Despite almost four decades of active research, it is still an open problem and current AMT systems and methods cannot match human performance in either accuracy or robustness [1].
A problem of music transcription is figuring out which notes are played and when they are played in a piece of music. This is also called note-level transcription [4]. A note produced by a pitched musical instrument has three basic attributes: pitch, onset, and offset. Pitch is a perceptual attribute but can be reliably related to the fundamental frequency (F0) of a harmonic or quasi-harmonic sound [5]. Onset refers to the beginning time of a note, in which amplitude of the note increases from zero to an audible level. This increase is very sharp for percussive pitched instruments such as piano. Offset refers to the ending time of a note, when the waveform of the note vanishes.
In the literature, the problems of pitch estimation and onset detection are often addressed separately and then combined to achieve note-level transcription. For onset detection, commonly used methods are based on spectral energy changes in successive frames [6]. These methods do not model the harmonic relation of frequencies that have this change, or the temporal evolution of partial energy of notes. Therefore, these methods tend to miss onsets of soft notes in polyphonic pieces and detect false positives due to local partial amplitude fluctuations caused by overlapping harmonics or reverberation.
Pitch estimation in monophonic music is considered a solved problem [7]. In contrast, polyphonic pitch estimation is much more challenging because of the complex interaction (e.g., overlapping harmonics) of multiple simultaneous notes. To properly identify all the concurrent pitches, the partials of the mixture must be separated and grouped into clusters belonging to different notes. Most multi-pitched analysis methods operate in the frequency domain with a time-frequency magnitude representation [1]. This approach has two fundamental limitations: it introduces the time-frequency resolution trade-off due to the Gabor limit [8], and it discards the phase, which may contain useful cues for the harmonic fusing of partials [5]. These two limitations generally lead to low accuracy for practical purposes, with state-of-the-art results below 70% as evaluated by MIREX 2015 on orchestral pieces with up to five instruments and piano pieces.
There are, in general, three approaches to note-level music transcription. Frame-based approaches estimate pitches in each individual time frame and then form notes in a postprocessing stage. Onset-based approaches first detect onsets and then estimate pitches within each inter-onset interval. Note-based approaches estimate notes including pitches and onsets directly.
A. Frame-Based Approach
Frame-level multi-pitch estimation (MPE) is the key component of this approach. The majority of recently proposed MPE methods operate in the frequency domain. One group of methods analyze or classify features extracted from the time-frequency representation of the audio input [1]. Raphael [10] used a Hidden Markov Model (HMM) in which the states represent note combinations and the observations are spectral features, such as energy, spectral flux, and mean and variance of each frequency band. Klapuri [11] used an iterative spectral subtraction approach to estimate a predominant pitch and subtract its harmonics from the mixture in each iteration. Yeh et al. [12] jointly estimated pitches based on three physical principles—harmonicity, spectral smoothness and synchronous amplitude evolution. More recently, Dressler [13] used a multi-resolution Short Time Fourier Transform (STFT) in which the magnitude of each bin is multiplied by the bin's instantaneous frequency. The pitch estimation is done by detecting peaks in the weighted spectrum and scoring them by harmonicity, spectral smoothness, presence of intermediate peaks and harmonic number. Poliner and Ellis [14] used support vector machines (SVM) to classify the presence of notes from the audio spectrum. Pertusa and Iesta [15] identified pitch candidates from spectral analysis of each frame, then selected the best combinations by applying a set of rules based on harmonic amplitudes and spectral smoothness. Saito et al. [16] applied a specmurt analysis by assuming a common harmonic structure of all the notes in each frame. Finally, methods based on deep neural networks are beginning to appear [17]-[20].
Another group of MPE methods are based on statistical frameworks. Goto [21] viewed the mixture spectrum as a probability distribution and modeled it with a mixture of tied-Gaussian mixture models. Duan et al. [22] and Emiya et al. [23] proposed Maximum-Likelihood (ML) approaches to model spectral peaks and non-peak regions of the spectrum. Peeling and Godsill [24] used non-homogenous Poisson processes to model the number of partials in the spectrum.
A popular group of MPE methods in recent years are based on spectrogram factorization techniques, such as Nonnegative Matrix Factorization (NMF) [25] or Probabilistic Latent Component Analysis (PLCA) [26]; the two methods are mathematically equivalent when the approximation is measured by Kullback-Leibler (KL) divergence. The first application of spectrogram factorization techniques to AMT was performed by Smaragdis and Brown [27]. Since then, many extensions and improvements have been proposed. Grindlay et al. [28] used the notion of eigeninstruments to model spectral templates as a linear combination of basic instrument models. Benetos et al. [29] extended PLCA by incorporating shifting across log-frequency to account for vibrato, i.e., frequency modulation. Abdallah et al. [30] imposed sparsity on the activation weights. O'Hanlon et al. [31], [32] used structured sparsity, also called group sparsity, to enforce harmonicity of the spectral bases.
Time domain methods are far less common than frequency domain methods for multi-pitch estimation. Early AMT methods operating in the time domain attempted to simulate the human auditory system with bandpass filters and autocorrelations [33], [34]. More recently, other researchers proposed time-domain probabilistic approaches based on Bayesian models [35]-[37]. Plumbley et al. [38] proposed and compared two approaches for sparse decomposition of polyphonic music, one in the time domain and the other in the frequency domain, however a complete transcription system was not demonstrated due to the necessity of manually annotating atoms, and the system was only evaluated on very short music excerpts, possibly because of the high computational requirements. Bello et al. [39] proposed a hybrid approach exploiting both frequency and time-domain information. More recently, Su and Yang [40] also combined information from spectral (harmonic series) and temporal (subharmonic series) representations.
To obtain a note-level transcription from frame-level pitch estimates, a post-processing step, such as a median filter [40] or a HMM [41], is often employed to connect pitch estimates across time into notes and remove isolated spurious pitches. These operations are performed on each note independently. To consider interactions of simultaneous notes, Duan and Temperley [42] proposed a maximum likelihood sampling approach to further refine the preliminary note tracking results.
B. Onset-Based Approach
In onset-based approaches, a separate onset detection stage is used during the transcription process. This approach is often adopted for transcribing piano music, given the relative prominence of onsets compared to other types of instruments. SONIC, a piano music transcription by Marolt et al., used an onset detection stage to refine the results of neural network classifiers [43]. Costantini et al. [44] proposed a piano music transcription method with an initial onset detection stage to detect note onsets; a single CQT window of the 64 ms following the note attack is used to estimate the pitches with a multi-class SVM classification. Cogliati and Duan [45] proposed a piano music transcription method with an initial onset detection stage followed by a greedy search algorithm to estimate the pitches between two successive onsets. This method models the entire temporal evolution of piano notes.
C. Note-Based Approach
Note-based approaches combine the estimation of pitches and onsets (and possibly offsets) into a single framework. While this increases the complexity of the model, it has the benefit of integrating the pitch information and the onset information for both tasks. As an extension to Goto's statistical method [21], Kameoka et al. [46] used so-called harmonic temporal structured clustering to jointly estimate pitches, onsets, offsets and dynamics. Berg-Kirkpatrick et al. [47] combined an NMF-like approach in which each note is modeled by a spectral profile and an activation envelope with a two-state HMM to estimate play and rest states. Ewert et al. [48] modeled each note as a series of states, each state being a log-magnitude frame, and used a greedy algorithm to estimate the activations of the states.
As set forth above, despite almost four decades of active research, still further improvements may be desired.