The present invention relates to audio signal processing, and, in particular, to an apparatus and method for harmonic-percussive-residual sound separation using the structure tensor on spectrograms.
Being able to separate a sound into its harmonic and percussive component is an effective preprocessing step for many applications.
While “Harmonic-Percussive(-Residual) Separation” is a common term, it is misleading as it implies a harmonic structure with sinusoidals having a frequency of an integer multiple of the fundamental frequency. Even though the correct term should be “Tonal-Percussive-(Residual) Separation”, the term and “harmonic” instead of “tonal” is used in the following for easier understanding.
Using the separated percussive component of a music recording for example can lead to a quality improvement for beat tracking (see [1]), rhythm analysis and transcription of rhythm instruments. The separated harmonic component is suitable for the transcription of pitched instruments and chord detection (see [3]). Furthermore, harmonic-percussive separation can be used for remixing purposes like changing the level ratio between both signal components (see [4]), which leads to an either “smoother” or “punchier” overall sound perception.
Some methods for harmonic-percussive sound separation rely on the assumption that harmonic sounds have a horizontal structure in the magnitude spectrogram of the input signal (in time direction), while percussive sounds appear as vertical structures (in frequency direction). Ono et al presented a method that first creates harmonically/percussively enhanced spectrograms by diffusion in time/frequency direction (see [5]). By comparing these enhanced representations afterwards, a decision if a sound is either harmonic or percussive could be derived.
A similar method was published by Fitzgerald, where the enhanced spectrograms were calculated by using median filtering in perpendicular directions instead of diffusion (see [6]), which led to similar results while reducing the computational complexity.
Inspired by the sines+transients+noise (S+T+N) signal model (see [7], [8], [9]), a framework that aims to describe the respective signal components by means of a small set of parameters. Fitzgerald's method was then extended to harmonic-percussive-residual (HPR) separation in [10]. As audio signals often consist of sounds that are neither clearly harmonic nor percussive, this procedure captures these sounds in a third, residual component. While some of these residual signals clearly have an isotropic, neither horizontal nor vertical, structure (as for example noise), there exist sounds that do not have a clear horizontal structure but nevertheless carry tonal information and may be perceived as harmonic part of a sound. An example are frequency modulated tones like they can occur in recordings of violin playing or vocals, where they are said to have “vibrato”. Due to the strategy of recognizing either horizontal or vertical structures, the aforementioned methods are not always able to capture such sounds in their harmonic component.
A harmonic-percussive separation procedure based on non-negative matrix factorization that is capable of capturing harmonic sounds with non-horizontal spectral structures in the harmonic component was proposed in [11]. However it did not include a third residual component.
Summarizing the above, recent methods rely on the observation that in a spectrogram representation, harmonic sounds lead to horizontal structures and percussive sounds lead to vertical structures. Furthermore, these methods associate structures that are neither horizontal nor vertical (i.e., non-harmonic, non-percussive sounds) with a residual category. However, this assumption does not hold for signals like frequency modulated tones that show fluctuating spectral structures, while nevertheless carrying tonal information.
The structure tensor, a tool used in image processing (see [12], [13]), is applied there to grey scale images for edge and corner detection (see [14]) or to estimate the orientation of an object. The structure tensor has already been used for preprocessing and feature extraction in audio processing (see [15], [16]).