Embodiments according to the invention relate to audio signal processing systems and, more particularly, to an apparatus and a method for determining a plurality of local center-of-gravity frequencies of a spectrum of an audio signal, the plurality of local center-of-gravity frequencies comprising a multitude of local center-of-gravity frequencies.
There is an increasing demand for digital signal processing techniques that address the need for extreme signal manipulations in order to fit pre-recorded audio signals, e.g. taken from a database, into a new musical context. In order to do so, high level semantic signal properties like pitch, musical key and scale mode are needed to be adapted. All these manipulations have in common that they aim at substantially altering the musical properties of the original audio material while preserving subjective sound quality as good as possible. In other words, these edits strongly change the audio material musical content but, nevertheless, are necessitated to preserve the naturalness of the processed audio sample and thus maintain believability. This ideally necessitates signal processing methods that are broadly applicable to different classes of signals including polyphonic mixed music content.
Therefore, a method for analysis, manipulation and synthesis of audio signals based on multiband modulation components has been proposed lately (see “S. Disch and B. Edler, “An amplitude- and frequency modulation vocoder for audio signal processing.” Proc. of the Int. Conf. on Digital Audio Effects (DAFx). 2008”, “S. Disch and B. Edler, “Multiband perceptual modulation analysis, processing and synthesis of audio signals,” Proc. of the IEEE-ICASSP, 2009”). The fundamental idea of this approach is to decompose polyphonic mixtures into components that are perceived as sonic entities anyway, and to further manipulate all signal elements that are contained in one component in a joint fashion. Additionally, a synthesis method has been introduced that renders a smooth and perceptually pleasant yet—depending on the type of manipulation applied—drastically modified output signal. If no manipulation whatsoever is applied to the components the method has been shown to provide transparent or near-transparent subjective audio quality (see “S. Disch and B. Edler, “An amplitude- and frequency modulation vocoder for audio signal processing,” Proc. of the Int. Conf. on Digital Audio Effects (DAFx), 2008”) for many test signals.
An important step for a block based polyphonic music manipulation, e.g. the multiband modulation decomposition, is the estimation of local centers of gravity (COG) (see “J. Anantharaman, A. Krishnamurthy, and L. Feth, “Intensity-weighted average of instantaneous frequency as a model for frequency discrimination.,” J. Acoust. Soc. Am., vol. 94, pp. 723-729, 1993”, “Q. Xu, L. L. Feth, J. N. Anantharaman, and A. K. Krishnamurthy, “Bandwidth of spectral resolution for the “c-o-g” effect in vowel-like complex sounds,” Acoustical Society of America Journal, vol. 101, pp. 3149-+, May 1997”) in successive spectra over time. This document shows an iterative algorithm, that can be used to determine a signal adaptive spectral decomposition that is aligned with local COG of the signal.
The COG approach may be reminiscent of the classic time frequency reassignment (t-f reassignment) method. For an extensive overview on this technique the reader is referred to (see “A. Fulop and K. Fitz, “Algorithms for computing the time corrected instantaneous frequency (reassigned) spectrogram, with applications”, Journal of the Acoustical Society of America, vol. 119, pp. 360-371, 2006”). Basically, t-f reassignment alters the regular time-frequency grid of a conventional Short Time Fourier Transform (STFT) towards a time-corrected instantaneous frequency spectrogram, thereby revealing temporal and spectral accumulations of energy that are better localized than implicated by the t-f resolution compromise inherent in the STFT spectrogram. Often, reassignment is used as an enhanced front-end for subsequent partial tracking (see “K. Fitz and L. Haken, “On the use of time-frequency reassignment in additive sound modeling”, Journal of the Audio Engineering Society, vol. 50(11), pp. 879-893, 2002”).
Other related publications aim at the estimation of multiple fundamental frequencies (see “A Klapuri, Signal Processing Methods For the Automatic Transcription of Music, Ph.D. thesis, Tampere University of Technology, 2004”, “Chunghsin Yeh, Multiple fundamental frequency estimation of polyphonic recordings, Ph.D. thesis, École doctorale edité, Université de Paris, 2008”) by grouping spectral peaks which exhibit certain harmonic relations into separate sources. However, for complex music composed of many sources (like orchestral music), this approach has no reasonable chance.
In some applications vocoders are used for signal manipulation. One class of vocoders are phase vocoders. A tutorial on phase vocoders is the publication ““The Phase Vocoder: A tutorial”, Mark Dolson, Computer Music Journal, Volume 10, No. 4, pages 14 to 27, 1986”. An additional publication is ““New phase vocoder techniques for pitch-shifting, harmonizing and other exotic effects”, L. Laroche and M. Dolson, proceedings 1999, IEEE workshop on applications of signal processing to audio and acoustics, New Paltz, N.Y., Oct. 17 to 20, 1999, pages 91 to 94”.
FIGS. 17 and 18 illustrate different implementations and applications for a phase vocoder. FIG. 17 illustrates a filter bank implementation of a phase vocoder 1700, in which an audio signal is provided at an input 500, and where, at an output 510, a synthesized audio signal is obtained. Specifically, each channel of the filter bank illustrated in FIG. 17 comprises a band pass filter 501 and a subsequently connected oscillator 502. Output signals of all oscillators 502 from all channels are combined via a combiner 503, which is illustrated as an adder. At the output of the combiner 503, the output signal 510 is obtained.
Each filter 501 is implemented to provide, on the one hand, an amplitude signal A(t), and on the other hand, the frequency signal f(t). The amplitude signal and the frequency signal are time signals. The amplitude signal illustrates a development of the amplitude within a filter band over time and the frequency signal illustrates the development of the frequency of a filter output signal over time.
As schematic implementation of a filter 501 is illustrated in FIG. 18. The incoming signal is routed into two parallel paths. In one path, the signal is multiplied by a sine wave with an amplitude of 1.0 and a frequency equal to the center frequency of the band pass filter as illustrated at 551. In the other path, the signal is multiplied by a cosine wave of the same amplitude and frequency as illustrated at 551. Thus, the two parallel paths are identical except for the phase of the multiplying wave form. Then, in each path, the result of the multiplication is fed into a low pass filter 553. The multiplication operation itself is also known as a simple ring modulation. Multiplying any signal by a sine (or cosine) wave of constant frequency has the effect of simultaneously shifting all the frequency components in the original signal by both plus and minus the frequency of the sine wave. If this result is now passed through an appropriate low pass filter, only the low frequency portion will remain. This sequence of operations is also known as heterodyning. This heterodyning is performed in each of the two parallel paths, but since one path heterodynes with a sine wave, while the other path uses a cosine wave, the resulting heterodyned signals in the two paths are out of phase by 90°. The upper low pass filter 553, therefore, provides a quadrate signal 554 and the lower filter 553 provides an in-phase signal. These two signals, which are also known as I and Q signals, are forwarded into a coordinate transformer 556 which generates a magnitude/phase representation from the rectangular representation.
The amplitude signal is output at 557 and corresponds to A(t) from FIG. 17. The phase signal is input into a phase unwrapper 558. At the output of element 558 there does not exist a phase value between 0 and 360° but a phase value which increases in a linear way. This “unwrapped” phase value is input into a phase/frequency converter 559 which may, for example, be implemented as a phase-difference-device which subtracts a phase at a preceding time instant from phase at a current time instant in order to obtain the frequency value for the current time instant.
This frequency value is added to a constant frequency value fi of the filter channel i, in order to obtain a time-varying frequency value at an output 560.
The frequency value at the output 560 has a DC portion Fi and a changing portion which is also known as the “frequency fluctuation”, by which a current frequency of the signal in the filter channel deviates from the mean frequency Fi.
Thus, the phase vocoder as illustrated in FIG. 5 and FIG. 6 provides a separation of spectral information and time information. The spectral information is comprised in the specific filter bank channel and in the frequency fi, and the time information is in the frequency fluctuation and in the magnitude over time.
Another description of the phase vocoder is the Fourier transform interpretation. It consists of a succession of overlapping Fourier transforms taken over finite-duration windows in time. In the Fourier transform interpretation, attention is focused on the magnitude and phase values for all of the different filter bands or frequency bins at the single point in time. While in the filter bank interpretation, the re-synthesis can be seen as a classic example of additive synthesis with time varying amplitude and frequency controls for each oscillator, the synthesis, in the Fourier implementation, is accomplished by converting back to real-and-imaginary form and overlap-adding the successive inverse Fourier transforms. In the Fourier interpretation, the number of filter bands in the phase vocoder is the number of points in the Fourier transform. Similarly, the equal spacing in frequency of the individual filters can be recognized as the fundamental feature of the Fourier transform. On the other hand, the shape of the filter pass bands, i.e., the steepness of the cutoff at the band edges is determined by the shape of the window function which is applied prior to calculating the transform. For a particular characteristic shape, e.g., Hamming window, the steepness of the filter cutoff increases in direct proportion to the duration of the window.
It is useful to see that the two different interpretations of the phase vocoder analysis apply only to the implementation of the bank of band pass filters. The operation by which the outputs of these filter are expressed as time-varying amplitudes and frequencies is the same for both implementations. The basic goal of the phase vocoder is to separate temporal information from spectral information. The operative strategy is to divide the signal into a number of spectral bands and to characterize the time-varying signal in each band.
Two basic operations are particularly significant. These operations are time scaling and pitch transposition. It is possible to slow down a recorded sound simply by playing it back at a lower sample rate. This is analogous to playing a tape recording at a lower playback speed. But, this kind of simplistic time expansion simultaneously lowers the pitch by the same factor as the time expansion. Slowing down the temporal evolution of a sound without altering its pitch necessitates an explicit separation of temporal and spectral information. As noted above, this is precisely what the phase vocoder attempts to do. Stretching out the time-varying amplitude and frequency signals A(t) and f(t) to FIG. 5a does not change the frequency of the individual oscillators at all, but it does slow down the temporal evolution of the composite sound. The result is a time-expanded sound with the original pitch. The Fourier transform view of time scaling is so that, in order to time-expand a sound, the inverse FFTs can simply be spaced further apart than the analysis FFTs. As a result, spectral changes occur more slowly in the synthesized sound than in the original in this application, and the phase is resealed by precisely the same factor by which the sound is being time-expanded.
The other application is pitch transposition. Since the phase vocoder can be used to change the temporal evolution of a sound without changing its pitch, it should also be possible to do the reverse, i.e., to change the pitch without changing the duration. This is done by time-scale using the desired pitch-change factor and then to play the resulting sounds back at a sample rate modified by the same factor. For example, to raise the pitch by an octave, the sound is first time-expanded by a factor of 2 and the time-expansion is then played at twice the original sample rate.
An application of vocoders for processing audio signals is shown for example in “Sascha Disch, Bernd Edler: “An Amplitude- and Frequency-Modulation Vocoder for Audio Signal Processing”, Proceedings of the 11th International Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, Sep. 1-4, 2008”. In this document local center of gravity candidates are estimated by searching positive to negative transitions in a center of gravity position function. For this, the center of gravity position function is calculated for each value of the spectrum (for example for each spectral amplitude value or each power density value) for each time block of the audio signal. In this context, block sizes of N=214 values at 48 kHz sample frequency are mentioned. Therefore, the computational efforts for estimating the local center of gravity candidates are very high.
Additionally a post-selection procedure is necessitated to ensure that the final estimated center of gravity positions are approximately equidistant on a perceptual scale.